Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry service maintenance #811

Merged
merged 3 commits into from
Nov 19, 2022
Merged

Retry service maintenance #811

merged 3 commits into from
Nov 19, 2022

Conversation

nlordell
Copy link
Contributor

@nlordell nlordell commented Nov 17, 2022

This PR changes the service maintenance component to retry updating to the latest block after some delay instead of waiting until a new block is observed.

This is done for two reasons:

  1. In order to reduce alert noise because of maintenance update delays (I made sure to keep track of success and failure so we can setup dashboards and alerts for it).
  2. Because we want maintenance to run so we are at the "latest" block always. Without retrying, we stay 12 seconds knowing that there is a new block and we have stale data before trying to update again. So, conceptually, retrying here is not so much of a sin.

Furthermore, in terms of the first point, changing the parameters of the alerts is not really possible. The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead.

In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.

Test Plan

Added a unit test.

@nlordell nlordell requested a review from a team as a code owner November 17, 2022 22:43
Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.

I think it's fine in this case because you are adding metrics so we should still be able to see that something is wrong.

crates/shared/src/maintenance.rs Show resolved Hide resolved
crates/shared/src/maintenance.rs Show resolved Hide resolved
@vkgnosis
Copy link
Contributor

vkgnosis commented Nov 18, 2022

This is done in order to reduce alert noise because of maintenance update delays.

How often does this happen? The obvious knob to turn for this is to make the alert condition more lenient. Why not do that? Whether we should retry here should depend on whether it makes sense for this functionality, not the alert.

If a maintenance run takes 1 second and the time between blocks is 12 then it can make sense to do it.

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lg.
After merging we can drop #590

@nlordell
Copy link
Contributor Author

Whether we should retry here should depend on whether it makes sense for this functionality, not the alert.

So, this is the reason I'm not 100% convinced this is the way to go (as specified in the description).

Specifically, if we want maintenance to run so we are at the "latest" block always, then retrying makes sense. Otherwise, we stay 12 seconds knowing that there is a new block and we have stale data before trying to update again. So, conceptually, retrying here is not so much of a sin.

However, my worry is that we are hiding node issues with this too much.

Why not do that?

The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead.

Does this make sense?

@vkgnosis
Copy link
Contributor

Yeah makes sense. In the description it sounded like the only reason for this change was to reduce alert noise.

@nlordell
Copy link
Contributor Author

In the description it sounded like the only reason for this change was to reduce alert noise.

Will update the PR description.

@nlordell nlordell enabled auto-merge (squash) November 19, 2022 07:36
@nlordell nlordell merged commit 22ebd7a into main Nov 19, 2022
@nlordell nlordell deleted the retry-maintenance branch November 19, 2022 07:41
@github-actions github-actions bot locked and limited conversation to collaborators Nov 19, 2022
@MartinquaXD
Copy link
Contributor

Just noticed that the ServiceMaintenance can maintain multiple different update tasks. If only a single task in the list failed to update all of the successful tasks would also get triggered again. I guess it would make sense to adapt the retry logic to only update the failed tasks until all tasks ran successfully for the given block.

@nlordell
Copy link
Contributor Author

I guess it would make sense to adapt the retry logic to only update the failed tasks until all tasks ran successfully for the given block.

In theory, their updates should be no-ops (at least its the case for the event updaters).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants