Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait until server stabilizes after failing health check #4139

Closed
wants to merge 1 commit into from

Conversation

mtrippled
Copy link
Collaborator

@mtrippled mtrippled commented Apr 7, 2022

Instead of abending, wait until server stabilizes after failing online delete health check.

High Level Overview of Change

Instead of stopping during failed health checks, pause until the process recovers and continue deletion.

Context of Change

Mainly the function that performs periodic health checks for online delete.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still running some slightly longer tests, but this looks good. I left a comment wondering if the default time should be increased, but that's your call.

# 'age_threshold_seconds' old. If not, then continue
# sleeping for this number of seconds and
# checking until healthy.
# Default is 5.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making the default larger? Say 30s? In my own experiments with a Windows build, it took the node several minutes to recover. If a node falls out of sync, I doubt most will be able to recover and be stable enough to run the rotation again within 5s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with 5s for this timeout. We don't necessarily expect the node to actually recover in 5s. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I don't think the polling is expensive, so I guess it doesn't hurt to leave it at 5s.

failing online delete healch check.
@mtrippled mtrippled force-pushed the continuing-delete branch from f2143de to cbdfe9e Compare May 2, 2022 22:54
Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Approach looks reasonable and has the advantage of simplifying the code. Nicely done.

However I find the name stopping() for a method that that may sleep and halt progress for an indefinite time highly misleading when I'm reading the code. I'd like to suggests a rename there. It doesn't change any of the logic, but a better name could significantly improve readability. The best I came up with is handleHealth(). I'm sure there are better names, but I felt that, at least, handleHealth() improves readability. I also had handleHealth() return an enum instead of a bool which I felt also improved readability.

Here's a commit that does the rename: scottschurr@945e690 Feel free to cherry-pick that or use it as an example for a yet-better renaming.

If we're in too big a hurry to get the code checked in, then the rename can be applied later.

# 'age_threshold_seconds' old. If not, then continue
# sleeping for this number of seconds and
# checking until healthy.
# Default is 5.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with 5s for this timeout. We don't necessarily expect the node to actually recover in 5s. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.

@mtrippled mtrippled added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label May 4, 2022
This was referenced May 10, 2022
@intelliot intelliot changed the title Instead of abending, wait until server stabilizes after Wait until server stabilizes after failing health check May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants