Skip to content

Conversation

@smarterclayton
Copy link
Contributor

A context cancel is always an error for the task graph

A context cancel is always an error for the task graph
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 22, 2019
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
@abhinavdahiya
Copy link
Contributor

overridden by #255

also 255 has unit test.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

Details

In response to this:

overridden by #255

also 255 has unit test.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants