-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry service maintenance #811
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.
I think it's fine in this case because you are adding metrics so we should still be able to see that something is wrong.
How often does this happen? The obvious knob to turn for this is to make the alert condition more lenient. Why not do that? Whether we should retry here should depend on whether it makes sense for this functionality, not the alert. If a maintenance run takes 1 second and the time between blocks is 12 then it can make sense to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lg.
After merging we can drop #590
So, this is the reason I'm not 100% convinced this is the way to go (as specified in the description). Specifically, if we want maintenance to run so we are at the "latest" block always, then retrying makes sense. Otherwise, we stay 12 seconds knowing that there is a new block and we have stale data before trying to update again. So, conceptually, retrying here is not so much of a sin. However, my worry is that we are hiding node issues with this too much.
The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead. Does this make sense? |
69c8a51
to
fc318ea
Compare
Yeah makes sense. In the description it sounded like the only reason for this change was to reduce alert noise. |
Will update the PR description. |
Just noticed that the |
In theory, their |
This PR changes the service maintenance component to retry updating to the latest block after some delay instead of waiting until a new block is observed.
This is done for two reasons:
Furthermore, in terms of the first point, changing the parameters of the alerts is not really possible. The issue is that the current alert measures how stale our data is (i.e. how far the last updated block is from the latest block). This PR proposes actively trying to make our data less stale (retrying to update to newer blocks when an update fails) and start measuring how often that has issues instead.
In general, I hate retry logic as it hides more problematic underlying issues... I'm not 100% convinced this is the right approach to go.
Test Plan
Added a unit test.