OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay #973

wking · 2023-09-25T22:55:40Z

Picking #939 back to 4.14, with some minor manual edits because I'm not bringing back #926 (a feature adding support for new environments). I am bringing back #964, because that was fixing existing behavior (so not a feature) and having it made the rebase slightly easier.

…openshift#964) * availableupdates: do not reset lastTransitionTime on unchanged status The code in `evaluateConditionalUpdates` correctly uses `SetStatusCondition` to set conditions, which only updates the `LastTransitionTime` field when `Status` differs between the original and updated state. Previously though, the original state always contained empty conditions, because conditional updates are always obtained from OSUS and the fresh structure was never updated with existing conditions from the in-cluster status. * review: use existing mock condition instead of new code * review: use real queue instead of a mock

The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.

wking · 2023-09-25T22:56:43Z

/jira cherrypick OCPBUGS-19512

openshift-ci-robot · 2023-09-25T22:56:50Z

@wking: Jira Issue OCPBUGS-19512 has been cloned as Jira Issue OCPBUGS-19737. Will retitle bug to link to clone.
/retitle OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay

Details

In response to this:

/jira cherrypick OCPBUGS-19512

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-09-25T22:57:00Z

@wking: This pull request references Jira Issue OCPBUGS-19737, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
dependent bug Jira Issue OCPBUGS-19512 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Jira Issue OCPBUGS-19512 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
bug has dependents

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Picking #939 back to 4.14, with some minor manual edits because I'm not bringing back #926 (a feature adding support for new environments). I am bringing back #964, because that was fixing existing behavior (so not a feature) and having it made the rebase slightly easier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instead of waiting for the next round of evaluation, which might take minutes. For example, in 4.14.0-rc.1 testing [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate I0915 18:21:24.184272 1 availableupdates.go:50] First attempt to retrieve available updates I0915 18:21:24.845512 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.836566 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:39.843398 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:21:54.835464 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.769850 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:16.784421 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.842269 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:23:39.862590 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later. I0915 18:24:09.837669 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z I0915 18:24:24.843569 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. I0915 18:25:24.839869 1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later. ... I0915 20:26:07.109093 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z I0915 20:29:50.769739 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z I0915 20:33:34.432215 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z I0915 20:37:18.093261 1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z I'm not entirely clear on what the triggers were there, with 3m44s between those final entries. Operator.Run sets up: wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync) }, time.Second) and [2] docs UntilWithContext: UntilWithContext loops until context is done, running f every period. UntilWithContext is syntactic sugar on top of JitterUntilWithContext with zero jitter factor and with sliding = true (which means the timer for period starts after the f completes). So that should be waking up, draining the queue, sleeping a second, waking back up, draining the queue again, and on forever. Perhaps we are just backing off to the slowest DefaultControllerRateLimiter period [3], but I expect processNextWorkItem's calling handleErr is calling Forget on the queue, because I don't see any of its error-branch logging: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue' ...no hits... That suggests nothing is slowing down our queue processing from once-per-second (plus evaluation time). But what's feeding the queue items to process? There only Add calls seem to be in clusterVersionEventHandler, but checking audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort ... 2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default 2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default which are all hours before these 20:26 and similar update retrievals. I suspect this due to resyncPeriod(o.ResyncInterval) being passed to NewFilteredSharedInformerFactory when generating the ClusterVersion informer, putting a lower bound on the UpdateFunc event-handler frequency. My goal is to set the stage for faster cache-warming after receiving a batch of new PromQL update risks, as described in 530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909). I still have not adjusted the caching logic, so at the moment, it only gives us faster updates on the "that PromQL is still throttled" loop. The AddAfter avoids hot-looping on: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 right now. and instead gives us: 1. Can I evaluate the risks? 2. No? Requeue and return to 1 around a second from know. The new addUpdate avoids injecting the same Recommended=True target into availableUpdates multiple times while trying to evaluate another conditional update, now that we have the !needFreshFetch case, where we recycle the previous structure data without the fresh Cincinnati fetch to clear earlier additions. Without the addUpdate pivot, we get [5]: status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 ... conditionalUpdates: - conditions: - lastTransitionTime: "2023-09-21T09:29:30Z" message: The update is recommended, because none of the conditional update risks apply to this cluster. reason: AsExpected status: "True" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28 version: 4.15.0-0.nightly-2023-10-19-222222 risks: - matchingRules: - promql: promql: |- cluster_infrastructure_provider{type=~"nonexist"} or 0 * cluster_infrastructure_provider type: PromQL message: Clusters on nonexist provider, this imaginary bug can happen. name: SomeInfrastructureThing url: https://bug.example.com/c ... - conditions: - lastTransitionTime: "2023-09-21T09:29:31Z" message: |- On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z) SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen. SomeChannelThing URL: https://bug.example.com/b reason: MultipleReasons status: "False" type: Recommended release: image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405 version: 4.15.0-0.nightly-2023-10-17-000000 risks: - matchingRules: - promql: promql: cluster_installer type: PromQL message: On clusters on default invoker user, this imaginary bug can happen. name: SomeInvokerThing url: https://bug.example.com/a - matchingRules: - promql: promql: |- group(cluster_version_available_updates{channel="buggy"}) or 0 * group(cluster_version_available_updates{channel!="buggy"}) type: PromQL message: On clusters with the channel set to 'buggy', this imaginary bug can happen. name: SomeChannelThing url: https://bug.example.com/b lasting until the next Cincinnati fetch cleared out the availableUpdates redundancy. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472 [2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext [3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter [4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39 [5]: openshift#939 (comment)

530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after

LalatenduMohanty · 2023-09-26T22:42:42Z

/label backport-risk-assessed

LalatenduMohanty

/lgtm

openshift-ci · 2023-09-26T22:43:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jiajliu · 2023-09-27T01:37:41Z

/label cherry-pick-approved

openshift-ci-robot · 2023-09-27T01:41:31Z

/retest-required

Remaining retests: 0 against base HEAD 5a9a0f6 and 2 for PR HEAD 604568f in total

openshift-ci · 2023-09-27T06:46:38Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-09-27T06:51:09Z

@wking: Jira Issue OCPBUGS-19737: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#973

Jira Issue OCPBUGS-19737 has been moved to the MODIFIED state.

Details

In response to this:

Picking #939 back to 4.14, with some minor manual edits because I'm not bringing back #926 (a feature adding support for new environments). I am bringing back #964, because that was fixing existing behavior (so not a feature) and having it made the rebase slightly easier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-09-27T07:04:06Z

/cherrypick release-4.13

openshift-cherrypick-robot · 2023-09-27T07:04:48Z

@wking: new pull request created: #974

Details

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot · 2023-09-30T05:52:54Z

Fix included in accepted release 4.14.0-0.nightly-2023-09-29-231104

petr-muller and others added 2 commits September 25, 2023 13:05

openshift-ci bot requested review from DavidHurta and petr-muller September 25, 2023 22:56

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2023

openshift-ci bot changed the title ~~pkg/clusterconditions/promql: Warm cache with 1s delay~~ OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay Sep 25, 2023

openshift-ci bot requested a review from jiajliu September 25, 2023 22:57

wking added 2 commits September 25, 2023 16:01

wking force-pushed the 4.14-fast-fill-promql-cache branch from 13fbe56 to 604568f Compare September 25, 2023 23:02

openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 26, 2023

LalatenduMohanty approved these changes Sep 26, 2023

View reviewed changes

openshift-ci bot assigned LalatenduMohanty Sep 26, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2023

openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 27, 2023

openshift-merge-robot merged commit e3531ce into openshift:release-4.14 Sep 27, 2023

wking deleted the 4.14-fast-fill-promql-cache branch September 27, 2023 07:03

openshift-cherrypick-robot mentioned this pull request Sep 27, 2023

[release-4.13] OCPBUGS-19828: pkg/clusterconditions/promql: Warm cache with 1s delay #974

Merged

OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay #973

OCPBUGS-19737: pkg/clusterconditions/promql: Warm cache with 1s delay #973

Uh oh!

Conversation

wking commented Sep 25, 2023

Uh oh!

wking commented Sep 25, 2023

Uh oh!

openshift-ci-robot commented Sep 25, 2023

Uh oh!

openshift-ci-robot commented Sep 25, 2023

Uh oh!

LalatenduMohanty commented Sep 26, 2023

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 26, 2023

Uh oh!

jiajliu commented Sep 27, 2023

Uh oh!

openshift-ci-robot commented Sep 27, 2023

Uh oh!

openshift-ci bot commented Sep 27, 2023

Uh oh!

openshift-ci-robot commented Sep 27, 2023

Uh oh!

wking commented Sep 27, 2023

Uh oh!

openshift-cherrypick-robot commented Sep 27, 2023

Uh oh!

openshift-merge-robot commented Sep 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants