Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 7, 2023

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming. [edit: we dropped the warming-vs-resyncing distinction]

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 7, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 7, 2023

@wking: This pull request references OTA-942 which is a valid jira issue.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This commit sets up callbacks so that when we get throttled, the condition engine can come back later and let the available-update goroutine know an evaluation is available, without having to wait for the available-update goroutine to come back around with another ~minutes-delayed polling loop.

I'm also adding MinBetweenMatchesWhenCached, allowing faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" callback. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 7, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2023
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2023
@wking wking force-pushed the fast-fill-promql-cache branch from fbff2b4 to 873254c Compare September 19, 2023 05:43
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2023
@wking wking changed the title OTA-942: pkg/clusterconditions: Include callbacks for initially-failing evaluation OTA-942: pkg/clusterconditions/promql: Warm cache with 1s delay Sep 19, 2023
@wking wking marked this pull request as ready for review September 19, 2023 05:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 19, 2023

@wking: This pull request references OTA-942 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the fast-fill-promql-cache branch 7 times, most recently from 2e366cf to 294e0cf Compare September 19, 2023 10:23
@wking
Copy link
Member Author

wking commented Sep 19, 2023

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2023
@wking wking force-pushed the fast-fill-promql-cache branch 2 times, most recently from c971ce7 to ae43804 Compare September 19, 2023 10:44
@wking
Copy link
Member Author

wking commented Sep 19, 2023

Testing with Cluster Bot launch 4.15,openshift/cluster-version-operator#939 gcp:

$ oc get -o json clusterversion version | jq .status.desired
{
  "image": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a",
  "version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest"
}

Set up dummy graph-data riffing on OTA-520:

$ git checkout --orphan demo-cache-warming
$ git rm -rf .   # we don't want any of the previous branch's files in this branch
$ cat <<EOF >cincinnati-graph.json      
> {
>   "nodes": [
>     {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
>     {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
>   ],
>   "conditionalEdges": []
> }
> EOF
$ jq . cincinnati-graph.json  # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; first with no edges'
$ git push -u wking demo-cache-warming

Configure my cluster to consume that graph:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json"}]'
$ oc adm upgrade channel whatever

Check that it's working:

$ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Add a bunch of risks:

> $ cat <<EOF >cincinnati-graph.json
> {
>   "nodes": [
>     {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
>     {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
>   ],
>   "conditionalEdges": [
>     {
>       "edges": [{"from": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "to": "4.15.1"}],
>       "risks": [
>         {"url": "https://bug.example.com/a", "name": "A", "message": "A.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "max(cluster_proxy_enabled{type=~\"https?\"})"}}]},
>         {"url": "https://bug.example.com/b", "name": "B", "message": "B.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"}}]},
>         {"url": "https://bug.example.com/c", "name": "C", "message": "C.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"}}]}
>       ]
>     }
>   ]
> }
EOF
$ jq . cincinnati-graph.json  # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; lots of risks'
$ git push wking demo-cache-warming

The risks get evaluated:

$ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever

Recommended updates:

  VERSION     IMAGE
  4.15.1      registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Checking on timing:

$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[0].conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-09-19T20:51:27Z Recommended=True AsExpected: The update is recommended, because none of the conditional update risks apply to this cluster.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep availableupdates.go
...
I0919 20:46:54.937033       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:42:24Z
I0919 20:49:10.042854       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:46:54Z, will try later.
I0919 20:51:25.148521       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:46:54Z
I0919 20:51:25.325465       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:25.325975       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
...
I0919 20:51:26.324938       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:26.352599       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353104       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353594       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
...
I0919 20:51:27.352109       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:27.747240       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:53:40.253893       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:55:55.361345       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:51:27Z
I0919 20:58:10.467079       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:55:55Z, will try later.
I0919 21:00:25.573629       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:55:55Z
I0919 21:02:40.680171       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:00:25Z, will try later.
I0919 21:04:55.785728       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T21:00:25Z
I0919 21:07:10.892253       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:04:55Z, will try later.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'evaluate PromQL cluster condition'
I0919 20:51:25.295589       1 promql.go:119] evaluate PromQL cluster condition: "max(cluster_proxy_enabled{type=~\"https?\"})"
I0919 20:51:26.334624       1 promql.go:119] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
I0919 20:51:27.356257       1 promql.go:119] evaluate PromQL cluster condition: "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"

So that is warming all three risks from 20:51:25 to 20:51:27, at the desired one PromQL per second 🎉 . It is a bit hot on the internal queue triggers; we can probably use AddAfter or something to cool that off. We'll see if my cluster still has an hour left to get the second round of PromQL refreshes...

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member

petr-muller commented Sep 25, 2023

/test e2e-agnostic-ovn-upgrade-into-change

: [sig-node] nodes should not go unready after being upgraded and go unready only once
{  1 nodes violated upgrade expectations: 
 Node ci-op-b4tfk34r-deec6-x75rx-worker-centralus1-lj9tf went unready multiple times: 2023-09-23T07:24:29Z, 2023-09-23T08:06:37Z
 Node ci-op-b4tfk34r-deec6-x75rx-worker-centralus1-lj9tf went ready multiple times: 2023-09-23T07:24:29Z, 2023-09-23T08:07:09Z

Sippy says this flake is common now in 4.14

@petr-muller
Copy link
Member

Argh I actually wanted to /override 😬

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8210fc7 and 2 for PR HEAD f92ff4b in total

@wking
Copy link
Member Author

wking commented Sep 25, 2023

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

@openshift-cherrypick-robot

@wking: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

Details

In response to this:

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 25, 2023

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ddbf138 into openshift:master Sep 25, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-19512: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19512 has been moved to the MODIFIED state.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming. [edit: we dropped the warming-vs-resyncing distinction]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@wking: #939 failed to apply on top of branch "release-4.14":

Applying: pkg/cvo/availableupdates: Return a copy in getAvailableUpdates
Using index info to reconstruct a base tree...
M	pkg/cvo/availableupdates.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/availableupdates.go
CONFLICT (content): Merge conflict in pkg/cvo/availableupdates.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pkg/cvo/availableupdates: Return a copy in getAvailableUpdates
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the fast-fill-promql-cache branch September 25, 2023 22:50
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Sep 27, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353

wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 25, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/openshift-enhancements that referenced this pull request Oct 30, 2023
…nt PromQL

Catching up with openshift/cluster-version-operator@518b44612d
(pkg/clusterconditions/promql: Reduce MinBetweenMatches to 1s,
2023-09-18, openshift/cluster-version-operator#939).
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 2, 2023
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to evel this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 2, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 4, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 12, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 20, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants