Prevent panics on tokio runtime shutdown #832

vkgnosis · 2022-11-22T15:27:12Z

Some of our code spawns background tasks that are expected to run forever. We already panic the whole process if a task panics so this is usually the case but it can still be violated when we shutdown the tokio runtime by returning from main. This happens in the orderbook binary when we gracefully shutdown after receiving a SIGTERM signal. (We want to gracefully shutdown so that active api connections are still completed instead of being interrupted.)

When this happens code that expects a background task to run forever can observe it having exited and panic in response. For example, this has happened in

services/crates/shared/src/ethrpc/buffered.rs

Line 117 in 1b0c9af

.expect("worker task unexpectedly dropped");

log. It does not always happen but there is a chance it could. The panic is benign because we're already shutting down the pod but it is still annoying to get an alert about it.

One way to fix this is to identify all code that relies on spawned tasks living forever and change the code to not do this. This is the most correct fix and gives users of that code the most flexibility. Imagine a use case where you manually create a tokio runtime, use such a task inside of it, shutdown the runtime, continue with the rest of your program. This is the only completely correct fix.

I decided against this because it makes that code more complicated and tedious to write in order to fix a rare and mostly benign case. It could also hide issues where we exit a task accidentally. It is preferable to assume that tasks really live forever.

In this PR I fix the issue by calling std::process::exit where we used to return from main. This does not exit the runtime because nothing is dropped which makes it adhere to our assumption. To make this clearer I've also changed some signatures to return ! and removed some JoinHandles from select statements where these were already guaranteed to not resolve because their return type was already !.

This can probably happen in e2e tests too even though I've never observed it. In e2e test we shouldn't exit the process so that's not a good fix there. Anyway, this PR still makes sense in lieu of the mentioned "completely correct" fix.

Test Plan

There is no reliable reproduction for this issue and it happens rarely so we just have to check it doesn't happen again.

vkgnosis · 2022-11-22T15:27:40Z

crates/orderbook/src/main.rs

        _ = shutdown_signal() => {
            tracing::info!("Gracefully shutting down API");
            shutdown_sender.send(()).expect("failed to send shutdown signal");
            match tokio::time::timeout(Duration::from_secs(10), serve_api).await {
                Ok(inner) => inner.expect("API failed during shutdown"),
                Err(_) => tracing::error!("API shutdown exceeded timeout"),
            }
+            std::process::exit(0);


This is the only observable change in this PR.

crates/autopilot/src/lib.rs

nlordell

Very nice 🕵️ -work!

Code looks good. Just some inline questions about, what naively looks to me, like small functional changes in behaviour.

vkgnosis requested a review from a team as a code owner November 22, 2022 15:27

vkgnosis commented Nov 22, 2022

View reviewed changes

nlordell reviewed Nov 22, 2022

View reviewed changes

crates/autopilot/src/lib.rs Show resolved Hide resolved

nlordell approved these changes Nov 22, 2022

View reviewed changes

Exit process before shutting down tokio runtime

239aed0

vkgnosis force-pushed the shutdown-panic branch from f4e45ab to 239aed0 Compare November 22, 2022 15:52

MartinquaXD approved these changes Nov 28, 2022

View reviewed changes

vkgnosis enabled auto-merge (squash) November 28, 2022 10:40

Merge branch 'main' into shutdown-panic

37dab65

vkgnosis merged commit bfc0700 into main Nov 28, 2022

vkgnosis deleted the shutdown-panic branch November 28, 2022 10:44

github-actions bot locked and limited conversation to collaborators Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent panics on tokio runtime shutdown #832

Prevent panics on tokio runtime shutdown #832

vkgnosis commented Nov 22, 2022

vkgnosis Nov 22, 2022

nlordell left a comment

Prevent panics on tokio runtime shutdown #832

Prevent panics on tokio runtime shutdown #832

Conversation

vkgnosis commented Nov 22, 2022

Test Plan

vkgnosis Nov 22, 2022

Choose a reason for hiding this comment

nlordell left a comment

Choose a reason for hiding this comment