Fix race condition when resuming an aborted run #5600

bentsherman · 2024-12-11T22:14:09Z

When a Nextflow run is aborted for any reason, Nextflow will try to kill all jobs and save them to the cache as failed tasks, so that on a resumed run, the new jobs will use new task directories.

However, if Nextflow is unable to complete this cleanup for any reason, such as an abrupt crash, the task cache might not be updated. On a resumed run, the new task will be re-executed in the same directory as before:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy

Lines 806 to 826 in b23e42c

    
           Path resumeDir = null 
        
           boolean exists = false 
        
           try { 
        
               final entry = session.cache.getTaskEntry(hash, this) 
        
               resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null 
        
               if( resumeDir ) 
        
                   exists = resumeDir.exists() 
        
               log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry" 
        
               final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry) 
        
               if( cached ) 
        
                   break 
        
           } 
        
           catch (Throwable t) { 
        
               log.warn1("[${safeTaskName(task)}] Unable to resume cached task -- See log file for details", causedBy: t) 
        
           } 
        
           if( exists ) { 
        
               tries++ 
        
               continue 
        
           }

What I think happens here is:

the cache entry is missing
Nextflow doesn't check whether the task dir exists
Nextflow proceeds to re-use the task dir with a new task

This can cause a race condition because if the begin and exit files are still present, Nextflow could submit a job, detect the begin/exit codes from the previous job, mark the job as completed (as long as the required outputs are still present), and launch downstream tasks on truncated data.

I think this can only happen on grid executors because the GridTaskHandler checks the begin/exit files before polling the scheduler. The cloud executors poll their API first, and the local executor doesn't check these files at all.

It should be possible to replicate this issue on a grid executor by cancelling a run with -disable-jobs-cancellation and allowing the jobs to complete before resuming the run. Which by the way is an unsafe property of that CLI option.

I'm trying to modify the local executor to simulate the issue but haven't been able to replicate yet

Signed-off-by: Ben Sherman <[email protected]>

netlify · 2024-12-11T22:14:28Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`4cff6a8`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/675c7a7895eaad00082be30d

bentsherman · 2024-12-13T15:38:55Z

Alternative solution would be for the task to always delete .command.begin and .exitcode as a precaution. However I think we generally assume that a task should never reuse an existing work directory since the presence of old outputs could cause other problems, hence the current approach

pditommaso · 2024-12-13T16:41:44Z

Not sure to understand on what resource the race condition happens?

bentsherman · 2024-12-13T16:44:58Z

Maybe it's not so much a race condition as it is a cache corruption

Nextflow re-executes a task in the same directory as before, and because the previous .command.begin and .exitcode are present, Nextflow immediately thinks the task is finished

pditommaso · 2024-12-13T16:55:48Z

Basically we are saying the task fails 0 in the .exitcode ?

bentsherman · 2024-12-13T16:58:09Z

The task completes with exit code 0 because maybe Nextflow was previously killed before it was able to cancel the job. So the job runs to completion and writes the exit file. But there is no cache entry for this task so Nextflow assumes the task directory is safe to use.

pditommaso · 2024-12-13T17:03:32Z

ok, got it

pditommaso · 2024-12-13T17:09:41Z

modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy

-                resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null
-                if( resumeDir )
-                    exists = resumeDir.exists()
+                workDir = entry
+                    ? FileHelper.asPath(entry.trace.getWorkDir())
+                    : task.getWorkDirFor(hash)
+                if( workDir )
+                    exists = workDir.exists()

-                log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
-                final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry)
+                log.trace "[${safeTaskName(task)}] Cacheable folder=${workDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
+                final cached = shouldTryCache && exists && entry && entry.trace.isCompleted() && checkCachedOutput(task.clone(), workDir, hash, entry)


This is one of that piece of code the lesser the changes the better, both to make it simple to review and to keep history readable

I know. This is about as simple as it gets while keeping the intent clear.

check for a cache entry and it's work dir, or compute the work dir if there is no cache entry

check whether the work dir exists

check for cached outputs if the cache entry and work dir exist and the task completed

and further down:
4. if the outputs are cached then use them
5. otherwise, if the work dir exists then use a new work dir
6. otherwise, create the work dir and use it

ok, getting better, but why adding entry && in the cached condition? If the intent is to applied the same logic when the entry is missing should not the condition remain the same?

In the previous logic, it wouldn't check for cached outputs if the cache entry was missing. That is also how it behaves here. The && entry is required to prevent a null reference exception on entry.trace.isCompleted()

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2024-12-17T21:28:18Z

modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy

            final lock = lockManager.acquire(hash)
-            final workDir = task.getWorkDirFor(hash)
            try {
-                if( resumeDir != workDir )
-                    exists = workDir.exists()
-                if( !exists && !workDir.mkdirs() )
+                if( !workDir.mkdirs() )


I've been wondering about this lock over the task directory. I think the purpose is to prevent two tasks from using the same directory. But in that case maybe it should be over the previous try-catch block?

It should prevent two tasks from checking the same directory at the same time, because that is how a task determines whether to use the directory or try a different one

It may. however what I'm thinking that to solve this issue it should be assumed a new task run should always use a newly created directory. Not sure this logic satisfy it.

currently this lock is useless, because if two tasks request the same directory, the lock will serialize the mkdirs() call but won't actually cause the second task to move to a different directory

pditommaso · 2025-01-19T20:32:55Z

Closing in favour #5682. let me know if you have doubts regarding that

bentsherman · 2025-01-21T17:19:27Z

I think my approach makes the code easier to read, but either one should fix the original issue

Fix race condition when resuming an aborted run

544be43

Signed-off-by: Ben Sherman <[email protected]>

bentsherman marked this pull request as ready for review December 13, 2024 15:46

bentsherman requested a review from pditommaso December 13, 2024 15:46

pditommaso reviewed Dec 13, 2024

View reviewed changes

remove unnecessary check

4cff6a8

Signed-off-by: Ben Sherman <[email protected]>

bentsherman added the backport/24.10.4 label Dec 16, 2024

bentsherman commented Dec 17, 2024

View reviewed changes

pditommaso mentioned this pull request Jan 10, 2025

Task retried in same workdir possibly due to failed trace retrieval #5641

Open

pditommaso closed this Jan 19, 2025

pditommaso removed the backport/24.10.4 label Jan 20, 2025

bentsherman deleted the 5595-fix-race-condition-resume-aborted-run branch January 21, 2025 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition when resuming an aborted run #5600

Fix race condition when resuming an aborted run #5600

bentsherman commented Dec 11, 2024

netlify bot commented Dec 11, 2024 •

edited

Loading

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

pditommaso Dec 13, 2024

bentsherman Dec 13, 2024

pditommaso Dec 16, 2024

bentsherman Dec 16, 2024

bentsherman Dec 17, 2024

pditommaso Dec 18, 2024

bentsherman Dec 18, 2024

pditommaso commented Jan 19, 2025

bentsherman commented Jan 21, 2025

	Path resumeDir = null
	boolean exists = false
	try {
	final entry = session.cache.getTaskEntry(hash, this)
	resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null
	if( resumeDir )
	exists = resumeDir.exists()

	log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
	final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry)
	if( cached )
	break
	}
	catch (Throwable t) {
	log.warn1("[${safeTaskName(task)}] Unable to resume cached task -- See log file for details", causedBy: t)
	}

	if( exists ) {
	tries++
	continue
	}

Fix race condition when resuming an aborted run #5600

Fix race condition when resuming an aborted run #5600

Conversation

bentsherman commented Dec 11, 2024

netlify bot commented Dec 11, 2024 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

bentsherman commented Dec 13, 2024

pditommaso commented Dec 13, 2024

pditommaso Dec 13, 2024

Choose a reason for hiding this comment

bentsherman Dec 13, 2024

Choose a reason for hiding this comment

pditommaso Dec 16, 2024

Choose a reason for hiding this comment

bentsherman Dec 16, 2024

Choose a reason for hiding this comment

bentsherman Dec 17, 2024

Choose a reason for hiding this comment

pditommaso Dec 18, 2024

Choose a reason for hiding this comment

bentsherman Dec 18, 2024

Choose a reason for hiding this comment

pditommaso commented Jan 19, 2025

bentsherman commented Jan 21, 2025

netlify bot commented Dec 11, 2024 •

edited

Loading