Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent panics on tokio runtime shutdown #832

Merged
merged 2 commits into from
Nov 28, 2022
Merged

Prevent panics on tokio runtime shutdown #832

merged 2 commits into from
Nov 28, 2022

Conversation

vkgnosis
Copy link
Contributor

Some of our code spawns background tasks that are expected to run forever. We already panic the whole process if a task panics so this is usually the case but it can still be violated when we shutdown the tokio runtime by returning from main. This happens in the orderbook binary when we gracefully shutdown after receiving a SIGTERM signal. (We want to gracefully shutdown so that active api connections are still completed instead of being interrupted.)

When this happens code that expects a background task to run forever can observe it having exited and panic in response. For example, this has happened in

.expect("worker task unexpectedly dropped");
log. It does not always happen but there is a chance it could. The panic is benign because we're already shutting down the pod but it is still annoying to get an alert about it.

One way to fix this is to identify all code that relies on spawned tasks living forever and change the code to not do this. This is the most correct fix and gives users of that code the most flexibility. Imagine a use case where you manually create a tokio runtime, use such a task inside of it, shutdown the runtime, continue with the rest of your program. This is the only completely correct fix.

I decided against this because it makes that code more complicated and tedious to write in order to fix a rare and mostly benign case. It could also hide issues where we exit a task accidentally. It is preferable to assume that tasks really live forever.

In this PR I fix the issue by calling std::process::exit where we used to return from main. This does not exit the runtime because nothing is dropped which makes it adhere to our assumption. To make this clearer I've also changed some signatures to return ! and removed some JoinHandles from select statements where these were already guaranteed to not resolve because their return type was already !.

This can probably happen in e2e tests too even though I've never observed it. In e2e test we shouldn't exit the process so that's not a good fix there. Anyway, this PR still makes sense in lieu of the mentioned "completely correct" fix.

Test Plan

There is no reliable reproduction for this issue and it happens rarely so we just have to check it doesn't happen again.

@vkgnosis vkgnosis requested a review from a team as a code owner November 22, 2022 15:27
_ = shutdown_signal() => {
tracing::info!("Gracefully shutting down API");
shutdown_sender.send(()).expect("failed to send shutdown signal");
match tokio::time::timeout(Duration::from_secs(10), serve_api).await {
Ok(inner) => inner.expect("API failed during shutdown"),
Err(_) => tracing::error!("API shutdown exceeded timeout"),
}
std::process::exit(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only observable change in this PR.

Copy link
Contributor

@nlordell nlordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🕵️ -work!

Code looks good. Just some inline questions about, what naively looks to me, like small functional changes in behaviour.

@vkgnosis vkgnosis enabled auto-merge (squash) November 28, 2022 10:40
@vkgnosis vkgnosis merged commit bfc0700 into main Nov 28, 2022
@vkgnosis vkgnosis deleted the shutdown-panic branch November 28, 2022 10:44
@github-actions github-actions bot locked and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants