Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Mar 6, 2023

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed.

With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates:

  • 4.10.(z!=38): no risks
  • 4.10.38: OpenStackNodeCreationFails
  • 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName
  • 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn
  • 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn
  • 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn

By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.

…rget version

When changing channels it's possible that multiple new conditional
update risks will need to be evaluated.  For instance, a cluster
running 4.10.34 in a 4.10 channel might only have to evaluate
OpenStackNodeCreationFails.  But when the channel is changed to a 4.11
channel multiple new risks require evaluation and the evaluation of
new risks was throttled at one every 10 minutes.  This means if there
are three new risks it may take up to 20 minutes after the channel has
changed for the full set of conditional updates to be computed.

With this commit, I'm sorting the conditional updates in
version-descending order, which is the order we've used in the
ClusterVersion status since c9dd479 (pkg/cvo/availableupdates:
Sort (conditional)updates, 2021-09-29, openshift#663).  This prioritizes the
longest-hop risks.  For example, 4.10.34 currently has the following
updates:

* 4.10.(z!=38): no risks
* 4.10.38: OpenStackNodeCreationFails
* 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName
* 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn
* 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn
* 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn

By focusing on the largest target (say 4.11.30), we'd evaluate
AWSOldBootImagesLackAfterburn first.  If it did not match the current
cluster, 4.11.27 and later would be quickly recommended.  It would
take another 10m before the self-throttling allowed us to evaluate
ARM64SecCompError524, and once we had, that would unblock 4.11.26.
Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and
unblock 4.11.(10<=z<26).  And so on.  But folks on 4.10.34 today are
much more likely to be interested in 4.11.30 and other tip releases
than they are to care about 4.11.10 and other relatively old releases.
@wking wking changed the title pkg/cvo/availableupdates: Prioritize conditional risks for largest target version OCPBUGS-5469: pkg/cvo/availableupdates: Prioritize conditional risks for largest target version Mar 6, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Mar 6, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-5469, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed.

With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates:

  • 4.10.(z!=38): no risks
  • 4.10.38: OpenStackNodeCreationFails
  • 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName
  • 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn
  • 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn
  • 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn

By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2023
@wking
Copy link
Member Author

wking commented Mar 6, 2023

Testing with launch openshift/cluster-version-operator#909 aws to cluster-bot:

$ oc get -o json clusterversion version | jq .status.desired
{
  "image": "registry.build05.ci.openshift.org/ci-ln-cw3q32t/release@sha256:d48cd7d4393fed0a58e8a2a44ac288127ca9ca84d4e2fe871e95ee54bfa9db69",
  "version": "4.13.0-0.ci.test-2023-03-06-135014-ci-ln-cw3q32t-latest"
}

Slotting those into a test branch, and pointing my test cluster at that data:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/test-promql-prioritization/cincinnati-graph.json"}]'
$ oc adm upgrade channel test

Initial results look good, with 4.14.3 evaluated, 4.14.2's risk failed 13µs later, and 4.14.1's risk failed 23µs later:

$ oc adm upgrade --include-not-recommended
Cluster version is 4.13.0-0.ci.test-2023-03-06-135014-ci-ln-cw3q32t-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/test-promql-prioritization/cincinnati-graph.json
Channel: test

Recommended updates:

  VERSION     IMAGE
  4.14.3      quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333

Supported but not recommended updates:

  Version: 4.14.2
  Image: quay.io/openshift-release-dev/ocp-release@sha256:2222222222222222222222222222222222222222222222222222222222222222
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to Middle is unknown due to an evaluation failure: client-side throttling: only 12.99µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  Risk to 4.14.2 https://example.com/middle

  Version: 4.14.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to Oldest is unknown due to an evaluation failure: client-side throttling: only 22.79µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  Risk to 4.14.1 https://example.com/oldest

@LalatenduMohanty
Copy link
Member

/retest

@LalatenduMohanty
Copy link
Member

By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended.

You meant 4.11.30 instead of 4.11.27, right?

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-upgrade-out-of-change 530a509 link true /test e2e-agnostic-upgrade-out-of-change
ci/prow/e2e-agnostic-upgrade-into-change 530a509 link true /test e2e-agnostic-upgrade-into-change

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4f130ae and 2 for PR HEAD 530a509 in total

@openshift-merge-robot openshift-merge-robot merged commit d0a6799 into openshift:master Mar 11, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-5469: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-5469 has been moved to the MODIFIED state.

Details

In response to this:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed.

With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates:

  • 4.10.(z!=38): no risks
  • 4.10.38: OpenStackNodeCreationFails
  • 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName
  • 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn
  • 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn
  • 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn

By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the prioritize-promql-for-tip-targets branch March 11, 2023 04:35
@wking
Copy link
Member Author

wking commented Mar 11, 2023

By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended.

You meant 4.11.30 instead of 4.11.27, right?

All of 4.11.(27<=z<...) are currently in the same bucket vs. 4.10.34:

openshift/cincinnati-graph-data $ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph candidate-4.11 | grep '^4[.]10[.]34 -.* 4[.]11[.][0-9][0-9]'
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.12
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.13
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.16
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.17
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.18
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.20
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.21
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.22
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, DuplicateRelease, MachineConfigRenderingChurn)-> 4.11.23
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.24
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn)-> 4.11.25
4.10.34 -(blocked: ARM64SecCompError524, AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.26
4.10.34 -(blocked: AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.27
4.10.34 -(blocked: AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.28
4.10.34 -(blocked: AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.29
4.10.34 -(blocked: AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.30
4.10.34 -(blocked: AWSOldBootImagesLackAfterburn, LeakedMachineConfigBlocksMCO)-> 4.11.31

So evaluating risks for any of 4.11.27, 4.11.28, 4.11.29, 4.11.30, or 4.11.31 will provide enough data to immediately evaluate the others in the 4.11.(27<=z<...) bucket. At least until we find more regressions.

@wking
Copy link
Member Author

wking commented Mar 14, 2023

/cherrypick release-4.13

@openshift-cherrypick-robot

@wking: new pull request created: #912

Details

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 7, 2023
…tion

530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit sets up callbacks so that when we get
throttled, the condition engine can come back later and let the
available-update goroutine know an evaluation is available, without
having to wait for the available-update goroutine to come back around
with another ~minutes-delayed polling loop.

I'm also adding MinBetweenMatchesWhenCached, allowing faster initial
cache-warming while maintaining the same long-term load on the backing
match engine, so it doesn't take 10 minutes to get that initial "I've
figured it out now!" callback.  I'm preserving 10-minute throttling
between match calls when the cache is already warm.  But when the
cache is warming up, we allow a match call every second, for much
faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
A bit of a mystery.

Anyhow, my goal is to set the stage for faster cache-warming after
receiving a batch of new PromQL update risks, as described in
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909).  I still have not adjusted
the caching logic, so at the moment, it only gives us faster updates
on the "that PromQL is still throttled" loop.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
A bit of a mystery.

Anyhow, my goal is to set the stage for faster cache-warming after
receiving a batch of new PromQL update risks, as described in
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909).  I still have not adjusted
the caching logic, so at the moment, it only gives us faster updates
on the "that PromQL is still throttled" loop.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
A bit of a mystery.

Anyhow, my goal is to set the stage for faster cache-warming after
receiving a batch of new PromQL update risks, as described in
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909).  I still have not adjusted
the caching logic, so at the moment, it only gives us faster updates
on the "that PromQL is still throttled" loop.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  This commit breaks MinBetweenMatches into
MinBetweenMatchesInitial and MinBetweenMatchesWhenCached, allowing
faster initial cache-warming while maintaining the same long-term load
on the backing match engine, so it doesn't take 10 minutes to get that
initial "I've figured it out now!" result.  I'm preserving 10-minute
throttling between match calls when the cache is already warm.  But
when the cache is warming up, we allow a match call every second, for
much faster cache warming.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 22, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Sep 27, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Sep 27, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 25, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 25, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 2, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 2, 2023
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants