Improve failure/recovery #422

Stebalien · 2024-07-07T12:33:44Z

Right now, various services can fail early and exit due to some internal error. We don't currently have any way to recover.

Ideally, at the very top, we'd listen for such failures and attempt to re-start (possibly with a backoff).

Stebalien · 2024-07-30T18:47:12Z

We're going to need #392 for this to be completely correct, but there's no reason to wait on this.

But we will likely need to figure out #368 (comment) first. That is, we need a way to detect if a sub-service has failed and either restart it or restart it (if possible) or exit so our parent can restart us.

E.g.:

If the certificate exchange fails, we can likely just restart that and leave everything else running.
If the GPBFT participant fails, we likely need to restart most of the F3 stack.

So, the first task here will be to gracefully handle early service aborts.

Kubuxu · 2024-10-09T13:27:01Z

Adding to M2.5 as a review and confirm item.

Stebalien self-assigned this Jul 7, 2024

Stebalien mentioned this issue Jul 11, 2024

Cleanup F3 API to use Run/Start and Stop instead of context for lifetime #368

Closed

Stebalien added the good first issue Good for newcomers label Jul 30, 2024

Stebalien removed their assignment Aug 30, 2024

Kubuxu added this to the Milestone 2.5: Mainnet Deployment Readiness milestone Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve failure/recovery #422

Improve failure/recovery #422

Stebalien commented Jul 7, 2024

Stebalien commented Jul 30, 2024 •

edited

Loading

Kubuxu commented Oct 9, 2024

Improve failure/recovery #422

Improve failure/recovery #422

Comments

Stebalien commented Jul 7, 2024

Stebalien commented Jul 30, 2024 • edited Loading

Kubuxu commented Oct 9, 2024

Stebalien commented Jul 30, 2024 •

edited

Loading