Prevent panics on tokio runtime shutdown #832
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some of our code spawns background tasks that are expected to run forever. We already panic the whole process if a task panics so this is usually the case but it can still be violated when we shutdown the tokio runtime by returning from main. This happens in the orderbook binary when we gracefully shutdown after receiving a SIGTERM signal. (We want to gracefully shutdown so that active api connections are still completed instead of being interrupted.)
When this happens code that expects a background task to run forever can observe it having exited and panic in response. For example, this has happened in
services/crates/shared/src/ethrpc/buffered.rs
Line 117 in 1b0c9af
One way to fix this is to identify all code that relies on spawned tasks living forever and change the code to not do this. This is the most correct fix and gives users of that code the most flexibility. Imagine a use case where you manually create a tokio runtime, use such a task inside of it, shutdown the runtime, continue with the rest of your program. This is the only completely correct fix.
I decided against this because it makes that code more complicated and tedious to write in order to fix a rare and mostly benign case. It could also hide issues where we exit a task accidentally. It is preferable to assume that tasks really live forever.
In this PR I fix the issue by calling
std::process::exit
where we used to return from main. This does not exit the runtime because nothing is dropped which makes it adhere to our assumption. To make this clearer I've also changed some signatures to return!
and removed someJoinHandle
s fromselect
statements where these were already guaranteed to not resolve because their return type was already!
.This can probably happen in e2e tests too even though I've never observed it. In e2e test we shouldn't exit the process so that's not a good fix there. Anyway, this PR still makes sense in lieu of the mentioned "completely correct" fix.
Test Plan
There is no reliable reproduction for this issue and it happens rarely so we just have to check it doesn't happen again.