Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when resuming an aborted run #5600

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bentsherman
Copy link
Member

Close #5595

When a Nextflow run is aborted for any reason, Nextflow will try to kill all jobs and save them to the cache as failed tasks, so that on a resumed run, the new jobs will use new task directories.

However, if Nextflow is unable to complete this cleanup for any reason, such as an abrupt crash, the task cache might not be updated. On a resumed run, the new task will be re-executed in the same directory as before:

Path resumeDir = null
boolean exists = false
try {
final entry = session.cache.getTaskEntry(hash, this)
resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null
if( resumeDir )
exists = resumeDir.exists()
log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry)
if( cached )
break
}
catch (Throwable t) {
log.warn1("[${safeTaskName(task)}] Unable to resume cached task -- See log file for details", causedBy: t)
}
if( exists ) {
tries++
continue
}

What I think happens here is:

  1. the cache entry is missing
  2. Nextflow doesn't check whether the task dir exists
  3. Nextflow proceeds to re-use the task dir with a new task

This can cause a race condition because if the begin and exit files are still present, Nextflow could submit a job, detect the begin/exit codes from the previous job, mark the job as completed (as long as the required outputs are still present), and launch downstream tasks on truncated data.

I think this can only happen on grid executors because the GridTaskHandler checks the begin/exit files before polling the scheduler. The cloud executors poll their API first, and the local executor doesn't check these files at all.

It should be possible to replicate this issue on a grid executor by cancelling a run with -disable-jobs-cancellation and allowing the jobs to complete before resuming the run. Which by the way is an unsafe property of that CLI option.

I'm trying to modify the local executor to simulate the issue but haven't been able to replicate yet

Copy link

netlify bot commented Dec 11, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 544be43
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/675a0eb51e62630007303948
😎 Deploy Preview https://deploy-preview-5600--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resume from crashed head job caused process and its downstream dependency to run at the same time
1 participant