NO-JIRA: task graph: test speedup and code cleaup #1101

petr-muller · 2024-11-04T15:31:07Z

task graph test: do not wait after cancelation

Given correct synchronization I do not see any value in the Sleep, it just makes the test much slower:

$ go test --count 10 --run TestRunGraph/cancelation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	10.041s

$ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake
...
$ go test --count 10 --run TestRunGraph/cancellation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	0.043s

task graph: simplify code

We only start draining workCh when canceled (ctx.Done()). We never submit more work once canceled so reseting submitted record is useless

Given correct synchronization I do not see any value in the `Sleep`, it just makes the test much slower: ```console $ go test --count 10 --run TestRunGraph/cancelation_without_task_errors ./pkg/payload/ ok github.com/openshift/cluster-version-operator/pkg/payload 10.041s $ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake ... $ go test --count 10 --run TestRunGraph/cancellation_without_task_errors ./pkg/payload/ ok github.com/openshift/cluster-version-operator/pkg/payload 0.043s ```

We only start draining `workCh` when canceled (`ctx.Done()`). We never submit more work once canceled so reseting `submitted` record is useless

openshift-ci-robot · 2024-11-04T15:31:11Z

@petr-muller: This pull request explicitly references no jira issue.

Details

In response to this:

task graph test: do not wait after cancelation

Given correct synchronization I do not see any value in the Sleep, it just makes the test much slower:
$ go test --count 10 --run TestRunGraph/cancelation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	10.041s

$ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake
...
$ go test --count 10 --run TestRunGraph/cancellation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	0.043s
task graph: simplify code

We only start draining workCh when canceled (ctx.Done()). We never submit more work once canceled so reseting submitted record is useless

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2024-11-04T16:15:03Z

pkg/payload/task_graph.go

 			case <-ctx.Done():
 				select {
-				case runTask := <-workCh: // workers canceled, so remove any work from the queue ourselves
+				case <-workCh: // workers canceled, so remove any work from the queue ourselves


NIT:

With submitted[runTask.index] = false removed, perhaps we can update the comment as well?

Suggested change

case <-workCh: // workers canceled, so remove any work from the queue ourselves

case <-workCh: // Reset the inflight counter as workers got canceled. We never submit more work once canceled and thus no need to reset the submitted records.

The important part is still the "remove any work from the queue" though - that needs to stay. The channel gets drained (eventually...).

I'm not convinced about the usefulness of mentioning the submitted records - it explains the absence of code, but that only makes sense attached to a commit/code change, does not make that much sense in the actual code? "We are not doing something because it does not need to be done"-like comment IMO makes sense only if that absence is actually surprising, which does not seem to be the case here?

the "remove any work from the queue" though - that needs to stay.

I think i got it now.
"remove any work from the queue ourselves" corresponds to <-workCh (that is why they stay in the same line).

My mistake was I thought the comment was for the code below, currently having only inflight-- which i couldn't relate to the comment. Actually, inflight-- is the additional things we need to do when <-workCh takes place.

We are not doing something because it does not need to be done

I got the point.

like comment IMO makes sense only if that absence is actually surprising,

It would probably surprise the author of "submitted[runTask.index] = false". 😉 When we push a task, we do submitted[nextNode] = true, my intuitive feeling is that we would set it to false when it is removed from the queue before the task is completed. With a comment, it would be telling that we do not do it intentionally, instead of forgetting it. Maybe it is just me. Just a NIT anyway.

It would probably surprise the author of "submitted[runTask.index] = false"

😁

I'm the author of that line, in 632e763 (#455), and Petr's "We never submit more work once canceled..." makes sense to me, and if I git blame ... the line with the comment, I'll find his commit message, so all good on that side :)

hongkailiu · 2024-11-04T17:11:46Z

/lgtm

openshift-ci · 2024-11-04T17:14:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2024-11-04T17:41:05Z

pkg/payload/task_graph_test.go

 			callbacks: map[string]callbackFn{
 				"a": func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {
 					cancelFn()
-					time.Sleep(time.Second)


I'd added this with the test-case back in eaa3d19 (#255). I didn't talk about the Sleep specifically in the commit message, but my guess is that what I was aiming for was "make sure b isn't running because the task-graph runner knows a failed" to avoid "b would have run, except the runner got lucky and raced closed before it got around to completing". It's been in place for a long time now though, and I'm fine dropping the "be-extra certain" sleep at this point in order to allow us to run tests more quickly.

petr-muller · 2024-11-04T18:37:11Z

: [bz-etcd][invariant] alert/etcdHighCommitDurations should not be at or above info
{  etcdHighCommitDurations was at or above info for at least 3m44s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"azure", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 58s, firing for 3m44s:

Nov 04 18:00:48.917 - 224s  W namespace/openshift-etcd node/10.0.0.4:9979 pod/etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0 alert/etcdHighCommitDurations alertstate/firing severity/warning ALERTS{alertname="etcdHighCommitDurations", alertstate="firing", endpoint="etcd-metrics", instance="10.0.0.4:9979", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

/override ci/prow/e2e-agnostic-ovn

: [Jira:"Test Framework"] monitor test initial-and-final-operator-log-scraper collection expand_less 	1m25s
{  failed during collection
unable to scan operator logs: error reading log for pods/community-operators-hqwvr -n openshift-marketplace -c registry-server: Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-hqwvr/log?container=registry-server&timestamps=true": http2: client connection lost}
: [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection expand_less 	1m27s
{  failed during collection
Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-pod-network-disruption-test-jxdwt/pods?labelSelector=network.openshift.io%2Fdisruption-actor%3Dpoller%2Cnetwork.openshift.io%2Fdisruption-target%3Dpod-to-pod": http2: client connection lost}

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Neither of these seem to indicate a problem that could be caused by CVO

openshift-ci · 2024-11-04T18:37:26Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn, ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Details

In response to this:

: [bz-etcd][invariant] alert/etcdHighCommitDurations should not be at or above info
{  etcdHighCommitDurations was at or above info for at least 3m44s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"azure", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 58s, firing for 3m44s:

Nov 04 18:00:48.917 - 224s  W namespace/openshift-etcd node/10.0.0.4:9979 pod/etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0 alert/etcdHighCommitDurations alertstate/firing severity/warning ALERTS{alertname="etcdHighCommitDurations", alertstate="firing", endpoint="etcd-metrics", instance="10.0.0.4:9979", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

/override ci/prow/e2e-agnostic-ovn

: [Jira:"Test Framework"] monitor test initial-and-final-operator-log-scraper collection expand_less 	1m25s
{  failed during collection
unable to scan operator logs: error reading log for pods/community-operators-hqwvr -n openshift-marketplace -c registry-server: Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-hqwvr/log?container=registry-server&timestamps=true": http2: client connection lost}
: [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection expand_less 	1m27s
{  failed during collection
Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-pod-network-disruption-test-jxdwt/pods?labelSelector=network.openshift.io%2Fdisruption-actor%3Dpoller%2Cnetwork.openshift.io%2Fdisruption-target%3Dpod-to-pod": http2: client connection lost}

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Neither of these seem to indicate a problem that could be caused by CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-11-04T18:37:29Z

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller · 2024-11-05T12:57:28Z

/label no-qe

Still just test tweaks and refactor

openshift-bot · 2024-11-05T18:30:31Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.18.0-202411051638.p0.g90da0da.assembly.stream.el9.
All builds following this will include this PR.

petr-muller added 2 commits November 4, 2024 16:25

task graph: simplify code

df39a81

We only start draining `workCh` when canceled (`ctx.Done()`). We never submit more work once canceled so reseting `submitted` record is useless

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 4, 2024

openshift-ci bot requested review from hongkailiu and wking November 4, 2024 15:32

petr-muller changed the title ~~NO-JIRA: task graph: Test speedup and code cleaup~~ NO-JIRA: task graph: test speedup and code cleaup Nov 4, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2024

hongkailiu reviewed Nov 4, 2024

View reviewed changes

openshift-ci bot assigned hongkailiu Nov 4, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2024

wking reviewed Nov 4, 2024

View reviewed changes

openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Nov 5, 2024

openshift-merge-bot bot merged commit 90da0da into openshift:master Nov 5, 2024

petr-muller deleted the ocpbugs-22442-test-run-graph-mid-task-cancellation-flake branch November 5, 2024 14:18

	case <-workCh: // workers canceled, so remove any work from the queue ourselves
	case <-workCh: // Reset the inflight counter as workers got canceled. We never submit more work once canceled and thus no need to reset the submitted records.

NO-JIRA: task graph: test speedup and code cleaup #1101

NO-JIRA: task graph: test speedup and code cleaup #1101

Uh oh!

Conversation

petr-muller commented Nov 4, 2024

Uh oh!

openshift-ci-robot commented Nov 4, 2024

Uh oh!

hongkailiu Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petr-muller Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

hongkailiu Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

petr-muller Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

wking Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Nov 4, 2024

Uh oh!

openshift-ci bot commented Nov 4, 2024

Uh oh!

wking Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Nov 4, 2024

Uh oh!

openshift-ci bot commented Nov 4, 2024

Uh oh!

openshift-ci bot commented Nov 4, 2024

Uh oh!

petr-muller commented Nov 5, 2024

Uh oh!

openshift-bot commented Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hongkailiu Nov 4, 2024 •

edited

Loading