Wait until server stabilizes after failing health check #4139

mtrippled · 2022-04-07T02:44:54Z

Instead of abending, wait until server stabilizes after failing online delete health check.

High Level Overview of Change

Instead of stopping during failed health checks, pause until the process recovers and continue deletion.

Context of Change

Mainly the function that performs periodic health checks for online delete.

Type of Change

Bug fix (non-breaking change which fixes an issue)

ximinez

I'm still running some slightly longer tests, but this looks good. I left a comment wondering if the default time should be increased, but that's your call.

ximinez · 2022-04-21T19:54:55Z

cfg/rippled-example.cfg

+#                           'age_threshold_seconds' old. If not, then continue
+#                           sleeping for this number of seconds and 
+#                           checking until healthy.
+#                           Default is 5.


What do you think about making the default larger? Say 30s? In my own experiments with a Windows build, it took the node several minutes to recover. If a node falls out of sync, I doubt most will be able to recover and be stable enough to run the rotation again within 5s.

I'm okay with 5s for this timeout. We don't necessarily expect the node to actually recover in 5s. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.

Yeah, that's a good point. I don't think the polling is expensive, so I guess it doesn't hurt to leave it at 5s.

failing online delete healch check.

scottschurr

👍 Approach looks reasonable and has the advantage of simplifying the code. Nicely done.

However I find the name stopping() for a method that that may sleep and halt progress for an indefinite time highly misleading when I'm reading the code. I'd like to suggests a rename there. It doesn't change any of the logic, but a better name could significantly improve readability. The best I came up with is handleHealth(). I'm sure there are better names, but I felt that, at least, handleHealth() improves readability. I also had handleHealth() return an enum instead of a bool which I felt also improved readability.

Here's a commit that does the rename: scottschurr@945e690 Feel free to cherry-pick that or use it as an example for a yet-better renaming.

If we're in too big a hurry to get the code checked in, then the rename can be applied later.

scottschurr · 2022-05-02T23:57:04Z

cfg/rippled-example.cfg

+#                           'age_threshold_seconds' old. If not, then continue
+#                           sleeping for this number of seconds and 
+#                           checking until healthy.
+#                           Default is 5.


I'm okay with 5s for this timeout. We don't necessarily expect the node to actually recover in 5s. This is the rate at which we poll whether we've recovered. This is a little less than once every ledger, which smells like a reasonable value.

mtrippled requested review from ximinez and scottschurr April 7, 2022 02:44

mtrippled force-pushed the continuing-delete branch from b2c9354 to f2143de Compare April 7, 2022 05:42

scottschurr assigned ximinez and scottschurr Apr 7, 2022

ximinez approved these changes Apr 21, 2022

View reviewed changes

Instead of abending, wait until server stabilizes after

cbdfe9e

failing online delete healch check.

mtrippled force-pushed the continuing-delete branch from f2143de to cbdfe9e Compare May 2, 2022 22:54

scottschurr approved these changes May 3, 2022

View reviewed changes

mtrippled added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label May 4, 2022

This was referenced May 10, 2022

Propose 1.9.1-b1 #4158

Closed

Proposed 1.9.1-b1 #4161

Merged

intelliot changed the title ~~Instead of abending, wait until server stabilizes after~~ Wait until server stabilizes after failing health check May 10, 2022

manojsdoshi closed this in #4161 May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait until server stabilizes after failing health check #4139

Wait until server stabilizes after failing health check #4139

mtrippled commented Apr 7, 2022 •

edited by intelliot

Loading

ximinez left a comment •

edited

Loading

ximinez Apr 21, 2022

scottschurr May 2, 2022

ximinez May 9, 2022

scottschurr left a comment

scottschurr May 2, 2022

Wait until server stabilizes after failing health check #4139

Wait until server stabilizes after failing health check #4139

Conversation

mtrippled commented Apr 7, 2022 • edited by intelliot Loading

High Level Overview of Change

Context of Change

Type of Change

ximinez left a comment • edited Loading

Choose a reason for hiding this comment

ximinez Apr 21, 2022

Choose a reason for hiding this comment

scottschurr May 2, 2022

Choose a reason for hiding this comment

ximinez May 9, 2022

Choose a reason for hiding this comment

scottschurr left a comment

Choose a reason for hiding this comment

scottschurr May 2, 2022

Choose a reason for hiding this comment

mtrippled commented Apr 7, 2022 •

edited by intelliot

Loading

ximinez left a comment •

edited

Loading