-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait until server stabilizes after failing health check #4139
Conversation
b2c9354
to
f2143de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still running some slightly longer tests, but this looks good. I left a comment wondering if the default time should be increased, but that's your call.
# 'age_threshold_seconds' old. If not, then continue | ||
# sleeping for this number of seconds and | ||
# checking until healthy. | ||
# Default is 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making the default larger? Say 30s? In my own experiments with a Windows build, it took the node several minutes to recover. If a node falls out of sync, I doubt most will be able to recover and be stable enough to run the rotation again within 5s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with 5s
for this timeout. We don't necessarily expect the node to actually recover in 5s
. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. I don't think the polling is expensive, so I guess it doesn't hurt to leave it at 5s.
failing online delete healch check.
f2143de
to
cbdfe9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Approach looks reasonable and has the advantage of simplifying the code. Nicely done.
However I find the name stopping()
for a method that that may sleep and halt progress for an indefinite time highly misleading when I'm reading the code. I'd like to suggests a rename there. It doesn't change any of the logic, but a better name could significantly improve readability. The best I came up with is handleHealth()
. I'm sure there are better names, but I felt that, at least, handleHealth()
improves readability. I also had handleHealth()
return an enum
instead of a bool
which I felt also improved readability.
Here's a commit that does the rename: scottschurr@945e690 Feel free to cherry-pick that or use it as an example for a yet-better renaming.
If we're in too big a hurry to get the code checked in, then the rename can be applied later.
# 'age_threshold_seconds' old. If not, then continue | ||
# sleeping for this number of seconds and | ||
# checking until healthy. | ||
# Default is 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with 5s
for this timeout. We don't necessarily expect the node to actually recover in 5s
. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.
Instead of abending, wait until server stabilizes after failing online delete health check.
High Level Overview of Change
Instead of stopping during failed health checks, pause until the process recovers and continue deletion.
Context of Change
Mainly the function that performs periodic health checks for online delete.
Type of Change