-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements on pipeline cancel #2543
Improvements on pipeline cancel #2543
Conversation
The following is the coverage report on the affected files.
|
67936f4
to
8ad2727
Compare
The following is the coverage report on the affected files.
|
Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526
8ad2727
to
14e6785
Compare
The following is the coverage report on the affected files.
|
14e6785
to
69541df
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/meow
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions from me!
if _, err := clientSet.TektonV1alpha1().TaskRuns(pr.Namespace).Patch(rprt.TaskRunName, types.JSONPatchType, b, ""); err != nil { | ||
errs = append(errs, fmt.Errorf("Failed to patch TaskRun `%s` with cancellation: %s", rprt.TaskRunName, err).Error()) | ||
// Loop over the TaskRuns in the PipelineRun status. | ||
// If a TaskRun is not in the status yet we should not cancel it anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double checking, can we guarantee that all started TaskRuns will be in the status? im guessing we can but wanted to check: i think this would mean that in the same reconcile loop that we create a TaskRun, we always update the status (i'm wondering also if there was an error partway through if we might end up in an inconsistent state? e.g. we create the taskrun, then a transient networking error prevents us from updating the status?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that. In my understanding the reconciler would not take the same key from the queue. When we start to reconcile we get the latest version of the pr from the API, that should have all running taskruns. It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation uses the pipelineState
which requires fetching the pipeline and taskruns and tasks and pipeline resources and validating and binding all together... I was hoping to moving the catch of cancel earlier in the reconcile and thus avoid calculating the pipelineState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.
I guess in that case the next reconcile would notice that the status needed to be updated anyway maybe, then update it, then maybe the "is cancelled" block would catch any taskruns that need to be cancelled that werent before 🤔
it seems like eventually it would end up in the right state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, you make a very good point.
I agree that the scenario you describe is not covered.
Looking deeper into the code, it looks like it is not covered today either 😅
ResolvePipelineRun is used to build the pipeline run state. That function behaves as follows:
- for each task in the spec:
- look in the pipelinerun status for a matching taskrun
- when not found, generate a new name
- try to fetch the taskruns, if error and error is
NotFound
, ignore
This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable
will return the task again and it will end up running the same task(s) twice.
It is a rather unlikely case to happen, since it requires being able to create a taskrun and a few milliseconds later being unable to update the pipelinerun status, however it may well happen.
I think a way forward would be to ensure that the pipelinerun status is valid.
To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth. To do we only need the pipelinerun status and the object ref of the pipelinerun, so we do not need to go through the resolution and validation and conversion and re-conversion madness.
I will propose a patch that does that, keep this on hold, and once (if) the other patch is approved, it should then be fine to merge this roughly as it is today. Does it sound like a plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobcatfish Since - given #2558 - this patch does not change the situation compared to today - I wonder if we could go ahead with this and fix #2558 then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I setup a unit test to verify that we do not recover orphaned taskruns - assuming that they may happen in case the pipelinerun status sync fails.
This WIP fix shows how we could solve the issue: #2558
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable will return the task again and it will end up running the same task(s) twice.
yikes!! nice find :D
To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth
kk sounds good to me!
Thanks for looking into this and opening the other issue 🙏
@@ -48,6 +48,7 @@ var ( | |||
stats.UnitDimensionless) | |||
) | |||
|
|||
// Recorder holds keys for Tekton metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding docs!
@@ -160,7 +160,8 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error { | |||
// to update, and return the original error combined with any update error | |||
var merr error | |||
|
|||
if pr.IsDone() { | |||
switch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :D
/hold |
When a pipeline is cancelled, detect it early in the reconcile cycle and cancel all taskruns defined in the pipelinerun status, so that we don't spend time building the DAG and resolving all resources. The patch to cancel a TaskRun is the same for all TaskRuns, so build it once and apply it to all taskruns.
69541df
to
6f1a6ea
Compare
/hold cancel |
The following is the coverage report on the affected files.
|
Thanks for the indepth back and forth :D 💃 /lgtm |
Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526
Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526
Emit additional events: - Pipeline Start Emit all events through the events.go module. Align and simplify the reconcile structure to have clear points for error handling and emitting events. Depends on tektoncd#2543 Depends on tektoncd#2526
Changes
When a pipeline is cancelled, detect it early in the reconcile
cycle and cancel all taskruns defined in the pipelinerun status,
so that we don't spend time building the DAG and resolving all
resources.
The patch to cancel a TaskRun is the same for all TaskRuns, so
build it once and apply it to all taskruns.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.