Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: restart ETH witnessers on failure #2791

Merged
merged 21 commits into from
Jan 31, 2023
Merged

feat: restart ETH witnessers on failure #2791

merged 21 commits into from
Jan 31, 2023

Conversation

kylezs
Copy link
Contributor

@kylezs kylezs commented Jan 30, 2023

On any failure this will restart the ETH witnessers. This includes creating a fresh ETH WS client, which should resolve most, if not all of the WS connection issues.

Can reuse a lot of code and concepts to do the same for DOT.

Most of the pain (read: fun) in this PR was borrow checker issues. But the nice part about that is, it's a little easier to verify that this works just by looking at the types, since this compiles :)

Out of scope

  • What to do if the witnesser checkpointing panics? #2778 this can be done separately, this PR already big enough
  • Retry limit. I think ultimately, given how long running this will be, it's not super necessary to have a retry limit, particularly for use with the witnessers. It may become useful to introduce this for some other case, but for now, I don't think it's necessary.

@kylezs kylezs force-pushed the fix/restart-witness branch from 2b59a91 to 72e2b53 Compare January 30, 2023 10:32
@kylezs kylezs force-pushed the fix/restart-witness branch from 4209604 to 2303cae Compare January 30, 2023 13:18
@kylezs kylezs force-pushed the fix/restart-witness branch 2 times, most recently from a2490b9 to 8a935e2 Compare January 30, 2023 14:23
@@ -61,7 +62,7 @@ where
)> = None;

loop {
let epoch_start = epoch_start_receiver.recv().await?;
let epoch_start = epoch_start_receiver.recv().await.expect("Sender closed");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not expecting here is quite annoying to solve, and given panicking here shouldn't happen, and if it did it'd be a symptom of another problem, I figure it's pretty reasonable to just panic.

@kylezs kylezs force-pushed the fix/restart-witness branch from f2737d7 to 53141ae Compare January 30, 2023 15:21
@kylezs kylezs force-pushed the fix/restart-witness branch from 53141ae to 5899ab3 Compare January 30, 2023 15:23
@kylezs kylezs requested a review from dandanlen January 30, 2023 15:24
@kylezs kylezs marked this pull request as ready for review January 30, 2023 15:24
@kylezs kylezs requested a review from AlastairHolmes January 30, 2023 15:26
Copy link
Collaborator

@dandanlen dandanlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a closer look tomorrow.

}
}
assert_eq!(my_state, 5);
assert_eq!(counter, end_number - my_state);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is a bit convoluted - can be clarified a bit.

For example, it's not clear what these assertions are testing or why these numbers matter. Maybe just a naming thing, for example I'm guessing the number of restarts is significant here, so maybe something should be called restart_count and something else should be called iteration_count.

Copy link
Contributor Author

@kylezs kylezs Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified so the counting isn't important (which it isn't), just tracks the restarts and then the final success now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no more assertions - is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions are implicit. We'd be adding an assert assert_eq!(restart_count, 6) right after we have an if statement that says it's so. And we know this statement is the one that returns since we panic!() at the end.

The restart_count also cannot reach 6 unless it does restart, which requires hitting the error block 5 times, and there is only the one error case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively, the only way this test can hit the Ok() case, is if the restart code works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I made this point to Roy re. some of the swapping tests - we should try to keep tests as simple as the can be. In this case, if we want to test that it restarts, there's no need for the for loop counting to 10, and probably no need to restart six times either - if it restarts once, it will restart again.

@dandanlen
Copy link
Collaborator

LGTM.

I feel like having to override the actual error with the state is unfortunate, it means we lose any error context. I still think the best option is to convince the compiler of the correct lifetimes so that we don't need to return the state at all, but I can see how this isn't possible currently.

What if we use a custom return wrapper like this:

pub struct TaskRecovery<S, E>{ recovery_state: Option<S>, reason: E }

impl<S, E: std::error::Error> From<E> for TaskRecovery<S, E> {
	fn from(e: E) -> Self {
		TaskRecovery { state: None, reason: e }
	}
}

And then we can use map_err / ? instead of a custom macro, then we can let the error bubble up and we handle/log it in the restart code.

This also allows us to have a (non-panic) way to return an unrecoverable error, as before.

@kylezs
Copy link
Contributor Author

kylezs commented Jan 31, 2023

I agree that some of the error handling can probably be improved, but figured it can wait, also because I'm not certain on a good direction atm. Can probably generalise the task_scope a bit further to allow for some ideas like this too.

And then we can use map_err / ? instead of a custom macro,

In the cases the macros are used, it's inside a future, so we'd need to use them regardless, to get the state out of the future, since the future owns that state otherwise, we can't access inside of it. Adding a .map_err() and then returning the owned value means necessarily that the closure has captured that value, and the compiler isn't smart enough to know that you're actually returning directly after anyway (by use of the ?) which is why the macro is required. Effectively the macro expands a map_err in lifetime/compiler friendly way.

@kylezs kylezs merged commit 127c7d6 into main Jan 31, 2023
@kylezs kylezs deleted the fix/restart-witness branch January 31, 2023 15:36
@kylezs kylezs linked an issue Jan 31, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websocket RPC client should automatically reconnect if becomes broken
2 participants