Thread panics in SpawnedTask during shutdown. #12089

wiedld · 2024-08-20T19:41:42Z

Describe the bug

The existing unreachable assumes that no polling will occur after the runtime begins a shutdown. However, we found that while running datafusion in it's own runtime (own threadpool) we can actually hit this unreachable code -- when we start shutting down the executor and an internal poll still occurs.

We think that how we are executing our datafusion queries is not uncommon, and therefore the shutdown behavior should not be causing a thread panic in datafusion.

To Reproduce

we made a reproducer in this draft PR.

Expected behavior

Don't cause thread panics during shutdown.

Additional context

The state goal of SpawnedTask is Provides guarantees of aborting on Drop to keep it cancel-safe.

The API for SpawnedTask::join_unwind() is for returning the generic R with the result removed. Meaning, this API is called when we presume an error is not going to occur -- and panic if there is an error.

Whereas a join error due to task cancelling is a "possible" error. So do we:

(a) keep the current API and leave the thread panic?
(b) remove the join_unwind() API, since it's inherently not cancel-safe.
- there is an existing SpawnedTask::join() which returns Result<R> propagating the Error
(c) have a longer discussion of how we should be handling runtime shutdowns in datafusion.

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-20T20:17:10Z

I suggest a fourth option:

ignore a cancelled task (with comments about why) as it can only happen when the runtime is shutting down

Calling unreachable doesn't do any good in this case because it simply causes some thread to panic but that panic will not propagate to the main thread (as it is happening during a runtime shutdown anyways)

@DDtKey and @devinjdangelo for your thoughts I believe you added the code in question as part of #9422

DDtKey · 2024-08-20T20:36:03Z

Actually, I think the logic existed before #9422. There is a mention about shutdown:

datafusion/datafusion/common-runtime/src/common.rs

Lines 69 to 72 in e6e1eb2

    
           // Cancellation may be caused by two reasons: 
        
           // 1. Abort is called, but since we consumed `self`, it's not our case (`JoinHandle` not accessible outside). 
        
           // 2. The runtime is shutting down. 
        
           // So we consider this branch as unreachable.

I believe it's just something we need to handle. SpawnedTask improved the general behavior and unified the usage, but this particular case wasn't covered.

In general, I always prefer fallible API, so we can consider using join instead and get rid of join_unwind usages - returning an appropriate error instead.

But ignoring the error will be much easier (with some debug logging at least) - and that may be more than enough

alamb · 2024-08-20T21:35:28Z

Maybe we can add a info! or debug! type message to assist in anyone else who may encounter the issue

devinjdangelo · 2024-08-21T00:28:49Z

Maybe we can add a info! or debug! type message to assist in anyone else who may encounter the issue

I think this makes sense. If for some reason a caller needs the exact error, they could call join instead.

crepererum · 2024-08-21T08:22:25Z

I totally see the use case for join_unwind though: it resumes unwinding on another thread, which often simplifies debugging a lot. Yeeting through logs and errors that just say "some thread panicked" without an actual backtrace isn't much fun (based on personal experience). So I think we should keep that API, but make it fallible:

return OK(...) if the underlying task returned cleanly
panic if the underlying task panicked
return Err(...) if the underlying task couldn't be completed due to system reasons (which in the current tokio setup is always a runtime shutdown)

wiedld · 2024-08-21T22:33:55Z

PR fix is ready for review.

wiedld added the bug Something isn't working label Aug 20, 2024

wiedld mentioned this issue Aug 21, 2024

Fix thread panic when "unreachable" SpawnedTask code is reachable. #12086

Merged

alamb closed this as completed in #12086 Aug 23, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread panics in SpawnedTask during shutdown. #12089

Thread panics in SpawnedTask during shutdown. #12089

wiedld commented Aug 20, 2024 •

edited

Loading

alamb commented Aug 20, 2024

DDtKey commented Aug 20, 2024 •

edited

Loading

alamb commented Aug 20, 2024

devinjdangelo commented Aug 21, 2024

crepererum commented Aug 21, 2024

wiedld commented Aug 21, 2024

Thread panics in SpawnedTask during shutdown. #12089

Thread panics in SpawnedTask during shutdown. #12089

Comments

wiedld commented Aug 20, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Aug 20, 2024

DDtKey commented Aug 20, 2024 • edited Loading

alamb commented Aug 20, 2024

devinjdangelo commented Aug 21, 2024

crepererum commented Aug 21, 2024

wiedld commented Aug 21, 2024

wiedld commented Aug 20, 2024 •

edited

Loading

DDtKey commented Aug 20, 2024 •

edited

Loading