Retry service maintenance #811

nlordell · 2022-11-17T22:43:22Z

This PR changes the service maintenance component to retry updating to the latest block after some delay instead of waiting until a new block is observed.

This is done for two reasons:

In order to reduce alert noise because of maintenance update delays (I made sure to keep track of success and failure so we can setup dashboards and alerts for it).
Because we want maintenance to run so we are at the "latest" block always. Without retrying, we stay 12 seconds knowing that there is a new block and we have stale data before trying to update again. So, conceptually, retrying here is not so much of a sin.

Furthermore, in terms of the first point, changing the parameters of the alerts is not really possible. The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead.

In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.

Test Plan

Added a unit test.

MartinquaXD

In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.

I think it's fine in this case because you are adding metrics so we should still be able to see that something is wrong.

crates/shared/src/maintenance.rs

vkgnosis · 2022-11-18T10:58:40Z

This is done in order to reduce alert noise because of maintenance update delays.

How often does this happen? The obvious knob to turn for this is to make the alert condition more lenient. Why not do that? Whether we should retry here should depend on whether it makes sense for this functionality, not the alert.

If a maintenance run takes 1 second and the time between blocks is 12 then it can make sense to do it.

sunce86

Lg.
After merging we can drop #590

crates/shared/src/maintenance.rs

nlordell · 2022-11-18T14:20:57Z

Whether we should retry here should depend on whether it makes sense for this functionality, not the alert.

So, this is the reason I'm not 100% convinced this is the way to go (as specified in the description).

Specifically, if we want maintenance to run so we are at the "latest" block always, then retrying makes sense. Otherwise, we stay 12 seconds knowing that there is a new block and we have stale data before trying to update again. So, conceptually, retrying here is not so much of a sin.

However, my worry is that we are hiding node issues with this too much.

Why not do that?

The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead.

Does this make sense?

vkgnosis · 2022-11-18T14:40:11Z

Yeah makes sense. In the description it sounded like the only reason for this change was to reduce alert noise.

nlordell · 2022-11-19T07:33:48Z

In the description it sounded like the only reason for this change was to reduce alert noise.

Will update the PR description.

MartinquaXD · 2022-11-21T12:01:02Z

Just noticed that the ServiceMaintenance can maintain multiple different update tasks. If only a single task in the list failed to update all of the successful tasks would also get triggered again. I guess it would make sense to adapt the retry logic to only update the failed tasks until all tasks ran successfully for the given block.

nlordell · 2022-11-21T12:07:45Z

I guess it would make sense to adapt the retry logic to only update the failed tasks until all tasks ran successfully for the given block.

In theory, their updates should be no-ops (at least its the case for the event updaters).

nlordell requested a review from a team as a code owner November 17, 2022 22:43

MartinquaXD approved these changes Nov 18, 2022

View reviewed changes

crates/shared/src/maintenance.rs Show resolved Hide resolved

crates/shared/src/maintenance.rs Show resolved Hide resolved

sunce86 approved these changes Nov 18, 2022

View reviewed changes

sunce86 reviewed Nov 18, 2022

View reviewed changes

crates/shared/src/maintenance.rs Show resolved Hide resolved

Nicholas Rodrigues Lordello added 2 commits November 18, 2022 15:21

Retry service maintenance

53ced52

Added log messages

fc318ea

nlordell force-pushed the retry-maintenance branch from 69c8a51 to fc318ea Compare November 18, 2022 14:27

nlordell mentioned this pull request Nov 18, 2022

Add ServiceMaintaining Names #819

Closed

vkgnosis approved these changes Nov 18, 2022

View reviewed changes

Merge branch 'main' into retry-maintenance

052aade

nlordell enabled auto-merge (squash) November 19, 2022 07:36

nlordell merged commit 22ebd7a into main Nov 19, 2022

nlordell deleted the retry-maintenance branch November 19, 2022 07:41

github-actions bot locked and limited conversation to collaborators Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry service maintenance #811

Retry service maintenance #811

nlordell commented Nov 17, 2022 •

edited

Loading

MartinquaXD left a comment

vkgnosis commented Nov 18, 2022 •

edited

Loading

sunce86 left a comment

nlordell commented Nov 18, 2022

vkgnosis commented Nov 18, 2022

nlordell commented Nov 19, 2022

MartinquaXD commented Nov 21, 2022

nlordell commented Nov 21, 2022

Retry service maintenance #811

Retry service maintenance #811

Conversation

nlordell commented Nov 17, 2022 • edited Loading

Test Plan

MartinquaXD left a comment

Choose a reason for hiding this comment

vkgnosis commented Nov 18, 2022 • edited Loading

sunce86 left a comment

Choose a reason for hiding this comment

nlordell commented Nov 18, 2022

vkgnosis commented Nov 18, 2022

nlordell commented Nov 19, 2022

MartinquaXD commented Nov 21, 2022

nlordell commented Nov 21, 2022

nlordell commented Nov 17, 2022 •

edited

Loading

vkgnosis commented Nov 18, 2022 •

edited

Loading