Fix race condition when resuming an aborted run #5600
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Close #5595
When a Nextflow run is aborted for any reason, Nextflow will try to kill all jobs and save them to the cache as failed tasks, so that on a resumed run, the new jobs will use new task directories.
However, if Nextflow is unable to complete this cleanup for any reason, such as an abrupt crash, the task cache might not be updated. On a resumed run, the new task will be re-executed in the same directory as before:
nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy
Lines 806 to 826 in b23e42c
What I think happens here is:
This can cause a race condition because if the begin and exit files are still present, Nextflow could submit a job, detect the begin/exit codes from the previous job, mark the job as completed (as long as the required outputs are still present), and launch downstream tasks on truncated data.
I think this can only happen on grid executors because the GridTaskHandler checks the begin/exit files before polling the scheduler. The cloud executors poll their API first, and the local executor doesn't check these files at all.
It should be possible to replicate this issue on a grid executor by cancelling a run with
-disable-jobs-cancellation
and allowing the jobs to complete before resuming the run. Which by the way is an unsafe property of that CLI option.I'm trying to modify the local executor to simulate the issue but haven't been able to replicate yet