Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread panics in SpawnedTask during shutdown. #12089

Closed
wiedld opened this issue Aug 20, 2024 · 6 comments · Fixed by #12086
Closed

Thread panics in SpawnedTask during shutdown. #12089

wiedld opened this issue Aug 20, 2024 · 6 comments · Fixed by #12086
Labels
bug Something isn't working

Comments

@wiedld
Copy link
Contributor

wiedld commented Aug 20, 2024

Describe the bug

The existing unreachable assumes that no polling will occur after the runtime begins a shutdown. However, we found that while running datafusion in it's own runtime (own threadpool) we can actually hit this unreachable code -- when we start shutting down the executor and an internal poll still occurs.

We think that how we are executing our datafusion queries is not uncommon, and therefore the shutdown behavior should not be causing a thread panic in datafusion.

To Reproduce

we made a reproducer in this draft PR.

Expected behavior

Don't cause thread panics during shutdown.

Additional context

The state goal of SpawnedTask is Provides guarantees of aborting on Drop to keep it cancel-safe.

The API for SpawnedTask::join_unwind() is for returning the generic R with the result removed. Meaning, this API is called when we presume an error is not going to occur -- and panic if there is an error.

Whereas a join error due to task cancelling is a "possible" error. So do we:

  • (a) keep the current API and leave the thread panic?
  • (b) remove the join_unwind() API, since it's inherently not cancel-safe.
    • there is an existing SpawnedTask::join() which returns Result<R> propagating the Error
  • (c) have a longer discussion of how we should be handling runtime shutdowns in datafusion.
@wiedld wiedld added the bug Something isn't working label Aug 20, 2024
@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

I suggest a fourth option:

  1. ignore a cancelled task (with comments about why) as it can only happen when the runtime is shutting down

Calling unreachable doesn't do any good in this case because it simply causes some thread to panic but that panic will not propagate to the main thread (as it is happening during a runtime shutdown anyways)

@DDtKey and @devinjdangelo for your thoughts I believe you added the code in question as part of #9422

@DDtKey
Copy link
Contributor

DDtKey commented Aug 20, 2024

Actually, I think the logic existed before #9422. There is a mention about shutdown:

// Cancellation may be caused by two reasons:
// 1. Abort is called, but since we consumed `self`, it's not our case (`JoinHandle` not accessible outside).
// 2. The runtime is shutting down.
// So we consider this branch as unreachable.

I believe it's just something we need to handle. SpawnedTask improved the general behavior and unified the usage, but this particular case wasn't covered.

In general, I always prefer fallible API, so we can consider using join instead and get rid of join_unwind usages - returning an appropriate error instead.

But ignoring the error will be much easier (with some debug logging at least) - and that may be more than enough

@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

Maybe we can add a info! or debug! type message to assist in anyone else who may encounter the issue

@devinjdangelo
Copy link
Contributor

Maybe we can add a info! or debug! type message to assist in anyone else who may encounter the issue

I think this makes sense. If for some reason a caller needs the exact error, they could call join instead.

@crepererum
Copy link
Contributor

I totally see the use case for join_unwind though: it resumes unwinding on another thread, which often simplifies debugging a lot. Yeeting through logs and errors that just say "some thread panicked" without an actual backtrace isn't much fun (based on personal experience). So I think we should keep that API, but make it fallible:

  • return OK(...) if the underlying task returned cleanly
  • panic if the underlying task panicked
  • return Err(...) if the underlying task couldn't be completed due to system reasons (which in the current tokio setup is always a runtime shutdown)

@wiedld
Copy link
Contributor Author

wiedld commented Aug 21, 2024

PR fix is ready for review.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants