-
Notifications
You must be signed in to change notification settings - Fork 213
OCPBUGS-19512: pkg/clusterconditions/promql: Warm cache with 1s delay #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-19512: pkg/clusterconditions/promql: Warm cache with 1s delay #939
Conversation
|
@wking: This pull request references OTA-942 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Skipping CI for Draft Pull Request. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
fbff2b4 to
873254c
Compare
|
@wking: This pull request references OTA-942 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2e366cf to
294e0cf
Compare
|
/remove-lifecycle stale |
c971ce7 to
ae43804
Compare
|
Testing with Cluster Bot $ oc get -o json clusterversion version | jq .status.desired
{
"image": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a",
"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest"
}Set up dummy graph-data riffing on OTA-520: $ git checkout --orphan demo-cache-warming
$ git rm -rf . # we don't want any of the previous branch's files in this branch
$ cat <<EOF >cincinnati-graph.json
> {
> "nodes": [
> {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
> {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
> ],
> "conditionalEdges": []
> }
> EOF
$ jq . cincinnati-graph.json # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; first with no edges'
$ git push -u wking demo-cache-warmingConfigure my cluster to consume that graph: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json"}]'
$ oc adm upgrade channel whateverCheck that it's working: $ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest
Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.Add a bunch of risks: > $ cat <<EOF >cincinnati-graph.json
> {
> "nodes": [
> {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
> {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
> ],
> "conditionalEdges": [
> {
> "edges": [{"from": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "to": "4.15.1"}],
> "risks": [
> {"url": "https://bug.example.com/a", "name": "A", "message": "A.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "max(cluster_proxy_enabled{type=~\"https?\"})"}}]},
> {"url": "https://bug.example.com/b", "name": "B", "message": "B.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"}}]},
> {"url": "https://bug.example.com/c", "name": "C", "message": "C.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"}}]}
> ]
> }
> ]
> }
EOF
$ jq . cincinnati-graph.json # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; lots of risks'
$ git push wking demo-cache-warmingThe risks get evaluated: $ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest
Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever
Recommended updates:
VERSION IMAGE
4.15.1 registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855Checking on timing: $ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[0].conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-09-19T20:51:27Z Recommended=True AsExpected: The update is recommended, because none of the conditional update risks apply to this cluster.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep availableupdates.go
...
I0919 20:46:54.937033 1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:42:24Z
I0919 20:49:10.042854 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:46:54Z, will try later.
I0919 20:51:25.148521 1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:46:54Z
I0919 20:51:25.325465 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:25.325975 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
...
I0919 20:51:26.324938 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:26.352599 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353104 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353594 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
...
I0919 20:51:27.352109 1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:27.747240 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:53:40.253893 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:55:55.361345 1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:51:27Z
I0919 20:58:10.467079 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:55:55Z, will try later.
I0919 21:00:25.573629 1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:55:55Z
I0919 21:02:40.680171 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:00:25Z, will try later.
I0919 21:04:55.785728 1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T21:00:25Z
I0919 21:07:10.892253 1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:04:55Z, will try later.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'evaluate PromQL cluster condition'
I0919 20:51:25.295589 1 promql.go:119] evaluate PromQL cluster condition: "max(cluster_proxy_enabled{type=~\"https?\"})"
I0919 20:51:26.334624 1 promql.go:119] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
I0919 20:51:27.356257 1 promql.go:119] evaluate PromQL cluster condition: "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"So that is warming all three risks from 20:51:25 to 20:51:27, at the desired one PromQL per second 🎉 . It is a bit hot on the internal queue triggers; we can probably use |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-agnostic-ovn-upgrade-into-change Sippy says this flake is common now in 4.14 |
|
Argh I actually wanted to /override 😬 |
|
I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport: /cherrypick release-4.14 |
|
@wking: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@wking: Jira Issue OCPBUGS-19512: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-19512 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: #939 failed to apply on top of branch "release-4.14": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)
Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)
Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)
|
Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353 |
Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)
…nt PromQL Catching up with openshift/cluster-version-operator@518b44612d (pkg/clusterconditions/promql: Reduce MinBetweenMatches to 1s, 2023-09-18, openshift/cluster-version-operator#939).
Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to evel this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result.
I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.[edit: we dropped the warming-vs-resyncing distinction]