Improvements on pipeline cancel #2543

afrittoli · 2020-05-05T09:02:29Z

Changes

When a pipeline is cancelled, detect it early in the reconcile
cycle and cancel all taskruns defined in the pipelinerun status,
so that we don't spend time building the DAG and resolving all
resources.

The patch to cancel a TaskRun is the same for all TaskRuns, so
build it once and apply it to all taskruns.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Commit messages follow commit message best practices

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

tekton-robot · 2020-05-05T09:07:42Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/cancel.go	72.7%	73.7%	1.0
pkg/reconciler/pipelinerun/pipelinerun.go	72.4%	72.3%	-0.1

tekton-robot · 2020-05-05T09:49:26Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/cancel.go	72.7%	73.7%	1.0
pkg/reconciler/pipelinerun/pipelinerun.go	72.4%	72.2%	-0.2

Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526

tekton-robot · 2020-05-05T12:46:15Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/cancel.go	72.7%	73.7%	1.0
pkg/reconciler/pipelinerun/pipelinerun.go	72.4%	72.3%	-0.1

tekton-robot · 2020-05-05T16:51:38Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/cancel.go	72.7%	73.7%	1.0
pkg/reconciler/pipelinerun/pipelinerun.go	72.4%	72.2%	-0.2

vdemeester

/meow

tekton-robot · 2020-05-05T16:56:49Z

@vdemeester:

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2020-05-05T16:56:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bobcatfish

Just a few questions from me!

bobcatfish · 2020-05-05T17:40:18Z

pkg/reconciler/pipelinerun/cancel.go

-		if _, err := clientSet.TektonV1alpha1().TaskRuns(pr.Namespace).Patch(rprt.TaskRunName, types.JSONPatchType, b, ""); err != nil {
-			errs = append(errs, fmt.Errorf("Failed to patch TaskRun `%s` with cancellation: %s", rprt.TaskRunName, err).Error())
+	// Loop over the TaskRuns in the PipelineRun status.
+	// If a TaskRun is not in the status yet we should not cancel it anyways.


just double checking, can we guarantee that all started TaskRuns will be in the status? im guessing we can but wanted to check: i think this would mean that in the same reconcile loop that we create a TaskRun, we always update the status (i'm wondering also if there was an error partway through if we might end up in an inconsistent state? e.g. we create the taskrun, then a transient networking error prevents us from updating the status?)

Yeah, I thought about that. In my understanding the reconciler would not take the same key from the queue. When we start to reconcile we get the latest version of the pr from the API, that should have all running taskruns. It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.

The current implementation uses the pipelineState which requires fetching the pipeline and taskruns and tasks and pipeline resources and validating and binding all together... I was hoping to moving the catch of cancel earlier in the reconcile and thus avoid calculating the pipelineState.

It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.

I guess in that case the next reconcile would notice that the status needed to be updated anyway maybe, then update it, then maybe the "is cancelled" block would catch any taskruns that need to be cancelled that werent before 🤔

it seems like eventually it would end up in the right state?

Thank you, you make a very good point.
I agree that the scenario you describe is not covered.

Looking deeper into the code, it looks like it is not covered today either 😅
ResolvePipelineRun is used to build the pipeline run state. That function behaves as follows:

for each task in the spec:

look in the pipelinerun status for a matching taskrun

when not found, generate a new name

try to fetch the taskruns, if error and error is NotFound, ignore

This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable will return the task again and it will end up running the same task(s) twice.

It is a rather unlikely case to happen, since it requires being able to create a taskrun and a few milliseconds later being unable to update the pipelinerun status, however it may well happen.

I think a way forward would be to ensure that the pipelinerun status is valid.
To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth. To do we only need the pipelinerun status and the object ref of the pipelinerun, so we do not need to go through the resolution and validation and conversion and re-conversion madness.

I will propose a patch that does that, keep this on hold, and once (if) the other patch is approved, it should then be fine to merge this roughly as it is today. Does it sound like a plan?

@bobcatfish Since - given #2558 - this patch does not change the situation compared to today - I wonder if we could go ahead with this and fix #2558 then.

I setup a unit test to verify that we do not recover orphaned taskruns - assuming that they may happen in case the pipelinerun status sync fails.
This WIP fix shows how we could solve the issue: #2558

This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable will return the task again and it will end up running the same task(s) twice.

yikes!! nice find :D

To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth

kk sounds good to me!

Thanks for looking into this and opening the other issue 🙏

bobcatfish · 2020-05-05T17:42:07Z

pkg/reconciler/pipelinerun/metrics.go

@@ -48,6 +48,7 @@ var (
 		stats.UnitDimensionless)
 )

+// Recorder holds keys for Tekton metrics


thanks for adding docs!

bobcatfish · 2020-05-05T17:42:13Z

pkg/reconciler/pipelinerun/pipelinerun.go

@@ -160,7 +160,8 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
 	// to update, and return the original error combined with any update error
 	var merr error

-	if pr.IsDone() {
+	switch {


pkg/reconciler/pipelinerun/pipelinerun.go

afrittoli · 2020-05-06T08:33:51Z

/hold

When a pipeline is cancelled, detect it early in the reconcile cycle and cancel all taskruns defined in the pipelinerun status, so that we don't spend time building the DAG and resolving all resources. The patch to cancel a TaskRun is the same for all TaskRuns, so build it once and apply it to all taskruns.

afrittoli · 2020-05-07T11:27:19Z

/hold cancel

tekton-robot · 2020-05-07T11:30:35Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/cancel.go	72.7%	73.7%	1.0
pkg/reconciler/pipelinerun/pipelinerun.go	72.4%	72.2%	-0.2

bobcatfish · 2020-05-07T18:26:17Z

Thanks for the indepth back and forth :D 💃

/lgtm

Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526

afrittoli requested review from vdemeester and a user May 5, 2020 09:02

tekton-robot requested review from dibyom and dlorenc May 5, 2020 09:02

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 5, 2020

afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 67936f4 to 8ad2727 Compare May 5, 2020 09:45

afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 8ad2727 to 14e6785 Compare May 5, 2020 12:41

afrittoli mentioned this pull request May 5, 2020

Emit events from the PipelineRun controller #2545

Closed

3 tasks

afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 14e6785 to 69541df Compare May 5, 2020 16:48

vdemeester approved these changes May 5, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2020

bobcatfish reviewed May 5, 2020

View reviewed changes

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2020

afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 69541df to 6f1a6ea Compare May 7, 2020 11:27

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2020

tekton-robot assigned bobcatfish May 7, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2020

tekton-robot merged commit c861b37 into tektoncd:master May 7, 2020

bobcatfish mentioned this pull request May 14, 2020

Tekton Enhancement Proposals proposal 🤴 tektoncd/community#82

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements on pipeline cancel #2543

Improvements on pipeline cancel #2543

afrittoli commented May 5, 2020 •

edited

Loading

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

vdemeester left a comment

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

bobcatfish left a comment

bobcatfish May 5, 2020

bobcatfish May 5, 2020

afrittoli May 5, 2020

afrittoli May 5, 2020

bobcatfish May 5, 2020

afrittoli May 6, 2020

afrittoli May 6, 2020

afrittoli May 7, 2020

afrittoli May 7, 2020

bobcatfish May 7, 2020

bobcatfish May 5, 2020

bobcatfish May 5, 2020

afrittoli commented May 6, 2020

afrittoli commented May 7, 2020

tekton-robot commented May 7, 2020

bobcatfish commented May 7, 2020

Improvements on pipeline cancel #2543

Improvements on pipeline cancel #2543

Conversation

afrittoli commented May 5, 2020 • edited Loading

Changes

Submitter Checklist

Reviewer Notes

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented May 5, 2020

tekton-robot commented May 5, 2020

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afrittoli commented May 6, 2020

afrittoli commented May 7, 2020

tekton-robot commented May 7, 2020

bobcatfish commented May 7, 2020

afrittoli commented May 5, 2020 •

edited

Loading