OCPBUGS-19512: pkg/clusterconditions/promql: Warm cache with 1s delay #939

wking · 2023-06-07T23:20:14Z

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. ~~I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.~~ [edit: we dropped the warming-vs-resyncing distinction]

openshift-ci-robot · 2023-06-07T23:20:17Z

@wking: This pull request references OTA-942 which is a valid jira issue.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This commit sets up callbacks so that when we get throttled, the condition engine can come back later and let the available-update goroutine know an evaluation is available, without having to wait for the available-update goroutine to come back around with another ~minutes-delayed polling loop.

I'm also adding MinBetweenMatchesWhenCached, allowing faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" callback. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-06-07T23:20:18Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

pkg/cvo/availableupdates.go

openshift-bot · 2023-09-06T01:01:15Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot · 2023-09-19T05:45:59Z

@wking: This pull request references OTA-942 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-09-19T10:25:02Z

/remove-lifecycle stale

wking · 2023-09-19T21:16:46Z

Testing with Cluster Bot launch 4.15,openshift/cluster-version-operator#939 gcp:

$ oc get -o json clusterversion version | jq .status.desired
{
  "image": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a",
  "version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest"
}

Set up dummy graph-data riffing on OTA-520:

$ git checkout --orphan demo-cache-warming
$ git rm -rf .   # we don't want any of the previous branch's files in this branch
$ cat <<EOF >cincinnati-graph.json      
> {
>   "nodes": [
>     {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
>     {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
>   ],
>   "conditionalEdges": []
> }
> EOF
$ jq . cincinnati-graph.json  # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; first with no edges'
$ git push -u wking demo-cache-warming

Configure my cluster to consume that graph:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json"}]'
$ oc adm upgrade channel whatever

Check that it's working:

$ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Add a bunch of risks:

> $ cat <<EOF >cincinnati-graph.json
> {
>   "nodes": [
>     {"version": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:ebc665c8f2d59b1eefb6fb802228743d746de42c2f73214b63ae61e0e2d0e20a"},
>     {"version": "4.15.1", "payload": "registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"}
>   ],
>   "conditionalEdges": [
>     {
>       "edges": [{"from": "4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest", "to": "4.15.1"}],
>       "risks": [
>         {"url": "https://bug.example.com/a", "name": "A", "message": "A.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "max(cluster_proxy_enabled{type=~\"https?\"})"}}]},
>         {"url": "https://bug.example.com/b", "name": "B", "message": "B.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"}}]},
>         {"url": "https://bug.example.com/c", "name": "C", "message": "C.", "matchingRules": [{"type": "PromQL", "promql": {"promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"}}]}
>       ]
>     }
>   ]
> }
EOF
$ jq . cincinnati-graph.json  # sanity check your syntax as valid JSON to catch missing commas and such
$ git add cincinnati-graph.json
$ git commit -m 'WIP: Static Cincinnati graph to demo cache warming; lots of risks'
$ git push wking demo-cache-warming

The risks get evaluated:

$ oc adm upgrade
Cluster version is 4.15.0-0.test-2023-09-19-193506-ci-ln-qs5v8tt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: whatever

Recommended updates:

  VERSION     IMAGE
  4.15.1      registry.build02.ci.openshift.org/ci-ln-qs5v8tt/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Checking on timing:

$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[0].conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-09-19T20:51:27Z Recommended=True AsExpected: The update is recommended, because none of the conditional update risks apply to this cluster.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep availableupdates.go
...
I0919 20:46:54.937033       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:42:24Z
I0919 20:49:10.042854       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:46:54Z, will try later.
I0919 20:51:25.148521       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:46:54Z
I0919 20:51:25.325465       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:25.325975       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
...
I0919 20:51:26.324938       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk B (evaluation is throttled until 20:51:26Z because the query has never been evaluated)
I0919 20:51:26.352599       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353104       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:26.353594       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
...
I0919 20:51:27.352109       1 availableupdates.go:123] Requeue available-update evaluation, because "4.15.1" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk C (evaluation is throttled until 20:51:27Z because the query has never been evaluated)
I0919 20:51:27.747240       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:53:40.253893       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:51:27Z, will try later.
I0919 20:55:55.361345       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:51:27Z
I0919 20:58:10.467079       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T20:55:55Z, will try later.
I0919 21:00:25.573629       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T20:55:55Z
I0919 21:02:40.680171       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:00:25Z, will try later.
I0919 21:04:55.785728       1 availableupdates.go:54] Retrieving available updates again, because more than 3m51.298904692s has elapsed since 2023-09-19T21:00:25Z
I0919 21:07:10.892253       1 availableupdates.go:71] Available updates were recently retrieved, with less than 3m51.298904692s elapsed since 2023-09-19T21:04:55Z, will try later.
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'evaluate PromQL cluster condition'
I0919 20:51:25.295589       1 promql.go:119] evaluate PromQL cluster condition: "max(cluster_proxy_enabled{type=~\"https?\"})"
I0919 20:51:26.334624       1 promql.go:119] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
I0919 20:51:27.356257       1 promql.go:119] evaluate PromQL cluster condition: "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"

So that is warming all three risks from 20:51:25 to 20:51:27, at the desired one PromQL per second 🎉 . It is a bit hot on the internal queue triggers; we can probably use AddAfter or something to cool that off. We'll see if my cluster still has an hour left to get the second round of PromQL refreshes...

openshift-ci · 2023-09-25T17:37:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2023-09-25T17:38:11Z

/test e2e-agnostic-ovn-upgrade-into-change

: [sig-node] nodes should not go unready after being upgraded and go unready only once
{  1 nodes violated upgrade expectations: 
 Node ci-op-b4tfk34r-deec6-x75rx-worker-centralus1-lj9tf went unready multiple times: 2023-09-23T07:24:29Z, 2023-09-23T08:06:37Z
 Node ci-op-b4tfk34r-deec6-x75rx-worker-centralus1-lj9tf went ready multiple times: 2023-09-23T07:24:29Z, 2023-09-23T08:07:09Z

Sippy says this flake is common now in 4.14

petr-muller · 2023-09-25T17:38:57Z

Argh I actually wanted to /override 😬

openshift-ci-robot · 2023-09-25T20:00:16Z

/retest-required

Remaining retests: 0 against base HEAD 8210fc7 and 2 for PR HEAD f92ff4b in total

wking · 2023-09-25T20:02:16Z

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

openshift-cherrypick-robot · 2023-09-25T20:02:18Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

Details

In response to this:

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-09-25T21:57:29Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-09-25T22:01:26Z

@wking: Jira Issue OCPBUGS-19512: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19512 has been moved to the MODIFIED state.

Details

In response to this:

530a509 (#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. This pull request sets up faster initial cache-warming while maintaining the same long-term load on the backing match engine, so it doesn't take 10 minutes to get that initial "I've figured it out now!" result. ~~I'm preserving 10-minute throttling between match calls when the cache is already warm. But when the cache is warming up, we allow a match call every second, for much faster cache warming.~~ [edit: we dropped the warming-vs-resyncing distinction]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-09-25T22:02:17Z

@wking: #939 failed to apply on top of branch "release-4.14":

Applying: pkg/cvo/availableupdates: Return a copy in getAvailableUpdates
Using index info to reconstruct a base tree...
M	pkg/cvo/availableupdates.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/availableupdates.go
CONFLICT (content): Merge conflict in pkg/cvo/availableupdates.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pkg/cvo/availableupdates: Return a copy in getAvailableUpdates
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

I expect this pick to fail on context conflicts, but asking for it anyway to have an unambiguous motivation for the manual backport:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)

openshift-merge-robot · 2023-09-27T15:57:41Z

Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353

Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)

…nt PromQL Catching up with openshift/cluster-version-operator@518b44612d (pkg/clusterconditions/promql: Reduce MinBetweenMatches to 1s, 2023-09-18, openshift/cluster-version-operator#939).

Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)

965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to evel this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528

965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 7, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2023

wking commented Jun 7, 2023

View reviewed changes

pkg/cvo/availableupdates.go Outdated Show resolved Hide resolved

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2023

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2023

wking force-pushed the fast-fill-promql-cache branch from fbff2b4 to 873254c Compare September 19, 2023 05:43

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2023

wking changed the title ~~OTA-942: pkg/clusterconditions: Include callbacks for initially-failing evaluation~~ OTA-942: pkg/clusterconditions/promql: Warm cache with 1s delay Sep 19, 2023

wking marked this pull request as ready for review September 19, 2023 05:44

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2023

openshift-ci bot requested review from DavidHurta and LalatenduMohanty September 19, 2023 05:44

wking force-pushed the fast-fill-promql-cache branch 7 times, most recently from 2e366cf to 294e0cf Compare September 19, 2023 10:23

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2023

wking force-pushed the fast-fill-promql-cache branch 2 times, most recently from c971ce7 to ae43804 Compare September 19, 2023 10:44

petr-muller mentioned this pull request Sep 19, 2023

OCPBUGS-9050: Alert on failing conditional update risk evaluation #961

Merged

openshift-ci bot assigned petr-muller Sep 25, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2023

openshift-merge-robot merged commit ddbf138 into openshift:master Sep 25, 2023

wking deleted the fast-fill-promql-cache branch September 25, 2023 22:50

wking mentioned this pull request Sep 25, 2023

OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay #973

Merged

wking mentioned this pull request Oct 30, 2023

enhancements/update/targeted-update-edge-blocking: 1s between different PromQL openshift/enhancements#1507

Merged

wking mentioned this pull request Dec 2, 2023

OTA-855: Create GET requests to the Prometheus API server #999

Merged

wking mentioned this pull request Dec 20, 2023

OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009

Merged

OCPBUGS-19512: pkg/clusterconditions/promql: Warm cache with 1s delay #939

OCPBUGS-19512: pkg/clusterconditions/promql: Warm cache with 1s delay #939

Uh oh!

Conversation

wking commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 7, 2023 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jun 7, 2023

Uh oh!

Uh oh!

openshift-bot commented Sep 6, 2023

Uh oh!

openshift-ci-robot commented Sep 19, 2023 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Sep 19, 2023

Uh oh!

wking commented Sep 19, 2023

Uh oh!

openshift-ci bot commented Sep 25, 2023

Uh oh!

petr-muller commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Sep 25, 2023

Uh oh!

openshift-ci-robot commented Sep 25, 2023

Uh oh!

wking commented Sep 25, 2023

Uh oh!

openshift-cherrypick-robot commented Sep 25, 2023

Uh oh!

openshift-ci bot commented Sep 25, 2023

Uh oh!

openshift-ci-robot commented Sep 25, 2023

Uh oh!

openshift-cherrypick-robot commented Sep 25, 2023

Uh oh!

openshift-merge-robot commented Sep 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wking commented Jun 7, 2023 •

edited

Loading

openshift-ci-robot commented Jun 7, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 19, 2023 •

edited by openshift-ci bot

Loading

petr-muller commented Sep 25, 2023 •

edited

Loading