Skip to content

Conversation

@petr-muller
Copy link
Member

task graph test: do not wait after cancelation

Given correct synchronization I do not see any value in the Sleep, it just makes the test much slower:

$ go test --count 10 --run TestRunGraph/cancelation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	10.041s

$ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake
...
$ go test --count 10 --run TestRunGraph/cancellation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	0.043s

task graph: simplify code

We only start draining workCh when canceled (ctx.Done()). We never submit more work once canceled so reseting submitted record is useless

Given correct synchronization I do not see any value in the `Sleep`, it just makes the test much slower:

```console
$ go test --count 10 --run TestRunGraph/cancelation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	10.041s

$ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake
...
$ go test --count 10 --run TestRunGraph/cancellation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	0.043s
```
We only start draining `workCh` when canceled (`ctx.Done()`). We never submit more work once canceled so reseting `submitted` record is useless
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 4, 2024
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request explicitly references no jira issue.

Details

In response to this:

task graph test: do not wait after cancelation

Given correct synchronization I do not see any value in the Sleep, it just makes the test much slower:

$ go test --count 10 --run TestRunGraph/cancelation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	10.041s

$ git checkout ocpbugs-22442-test-run-graph-mid-task-cancellation-flake
...
$ go test --count 10 --run TestRunGraph/cancellation_without_task_errors  ./pkg/payload/
ok  	github.com/openshift/cluster-version-operator/pkg/payload	0.043s

task graph: simplify code

We only start draining workCh when canceled (ctx.Done()). We never submit more work once canceled so reseting submitted record is useless

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from hongkailiu and wking November 4, 2024 15:32
@petr-muller petr-muller changed the title NO-JIRA: task graph: Test speedup and code cleaup NO-JIRA: task graph: test speedup and code cleaup Nov 4, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2024
case <-ctx.Done():
select {
case runTask := <-workCh: // workers canceled, so remove any work from the queue ourselves
case <-workCh: // workers canceled, so remove any work from the queue ourselves
Copy link
Member

@hongkailiu hongkailiu Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

With submitted[runTask.index] = false removed, perhaps we can update the comment as well?

Suggested change
case <-workCh: // workers canceled, so remove any work from the queue ourselves
case <-workCh: // Reset the inflight counter as workers got canceled. We never submit more work once canceled and thus no need to reset the submitted records.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important part is still the "remove any work from the queue" though - that needs to stay. The channel gets drained (eventually...).

I'm not convinced about the usefulness of mentioning the submitted records - it explains the absence of code, but that only makes sense attached to a commit/code change, does not make that much sense in the actual code? "We are not doing something because it does not need to be done"-like comment IMO makes sense only if that absence is actually surprising, which does not seem to be the case here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "remove any work from the queue" though - that needs to stay.

I think i got it now.
"remove any work from the queue ourselves" corresponds to <-workCh (that is why they stay in the same line).

My mistake was I thought the comment was for the code below, currently having only inflight-- which i couldn't relate to the comment. Actually, inflight-- is the additional things we need to do when <-workCh takes place.

We are not doing something because it does not need to be done

I got the point.

like comment IMO makes sense only if that absence is actually surprising,

It would probably surprise the author of "submitted[runTask.index] = false". 😉 When we push a task, we do submitted[nextNode] = true, my intuitive feeling is that we would set it to false when it is removed from the queue before the task is completed. With a comment, it would be telling that we do not do it intentionally, instead of forgetting it. Maybe it is just me. Just a NIT anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably surprise the author of "submitted[runTask.index] = false"

😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm the author of that line, in 632e763 (#455), and Petr's "We never submit more work once canceled..." makes sense to me, and if I git blame ... the line with the comment, I'll find his commit message, so all good on that side :)

@hongkailiu
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

callbacks: map[string]callbackFn{
"a": func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {
cancelFn()
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd added this with the test-case back in eaa3d19 (#255). I didn't talk about the Sleep specifically in the commit message, but my guess is that what I was aiming for was "make sure b isn't running because the task-graph runner knows a failed" to avoid "b would have run, except the runner got lucky and raced closed before it got around to completing". It's been in place for a long time now though, and I'm fine dropping the "be-extra certain" sleep at this point in order to allow us to run tests more quickly.

@petr-muller
Copy link
Member Author

: [bz-etcd][invariant] alert/etcdHighCommitDurations should not be at or above info
{  etcdHighCommitDurations was at or above info for at least 3m44s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"azure", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 58s, firing for 3m44s:

Nov 04 18:00:48.917 - 224s  W namespace/openshift-etcd node/10.0.0.4:9979 pod/etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0 alert/etcdHighCommitDurations alertstate/firing severity/warning ALERTS{alertname="etcdHighCommitDurations", alertstate="firing", endpoint="etcd-metrics", instance="10.0.0.4:9979", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

/override ci/prow/e2e-agnostic-ovn

: [Jira:"Test Framework"] monitor test initial-and-final-operator-log-scraper collection expand_less 	1m25s
{  failed during collection
unable to scan operator logs: error reading log for pods/community-operators-hqwvr -n openshift-marketplace -c registry-server: Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-hqwvr/log?container=registry-server&timestamps=true": http2: client connection lost}
: [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection expand_less 	1m27s
{  failed during collection
Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-pod-network-disruption-test-jxdwt/pods?labelSelector=network.openshift.io%2Fdisruption-actor%3Dpoller%2Cnetwork.openshift.io%2Fdisruption-target%3Dpod-to-pod": http2: client connection lost}

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Neither of these seem to indicate a problem that could be caused by CVO

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2024

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn, ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Details

In response to this:

: [bz-etcd][invariant] alert/etcdHighCommitDurations should not be at or above info
{  etcdHighCommitDurations was at or above info for at least 3m44s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"azure", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 58s, firing for 3m44s:

Nov 04 18:00:48.917 - 224s  W namespace/openshift-etcd node/10.0.0.4:9979 pod/etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0 alert/etcdHighCommitDurations alertstate/firing severity/warning ALERTS{alertname="etcdHighCommitDurations", alertstate="firing", endpoint="etcd-metrics", instance="10.0.0.4:9979", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-9ddf1ctd-31db5-7dhcn-master-0", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

/override ci/prow/e2e-agnostic-ovn

: [Jira:"Test Framework"] monitor test initial-and-final-operator-log-scraper collection expand_less 	1m25s
{  failed during collection
unable to scan operator logs: error reading log for pods/community-operators-hqwvr -n openshift-marketplace -c registry-server: Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-hqwvr/log?container=registry-server&timestamps=true": http2: client connection lost}
: [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection expand_less 	1m27s
{  failed during collection
Get "https://api.ci-op-9ddf1ctd-bada2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-pod-network-disruption-test-jxdwt/pods?labelSelector=network.openshift.io%2Fdisruption-actor%3Dpoller%2Cnetwork.openshift.io%2Fdisruption-target%3Dpod-to-pod": http2: client connection lost}

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Neither of these seem to indicate a problem that could be caused by CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2024

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member Author

/label no-qe

Still just test tweaks and refactor

@openshift-ci openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Nov 5, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 90da0da into openshift:master Nov 5, 2024
@petr-muller petr-muller deleted the ocpbugs-22442-test-run-graph-mid-task-cancellation-flake branch November 5, 2024 14:18
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.18.0-202411051638.p0.g90da0da.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. no-qe Allows PRs to merge without qe-approved label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants