-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove node from cluster when node locks broken #61400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.
| } | ||
| currentUnhealthyPaths.add(path); | ||
| } | ||
| } catch (IllegalStateException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the try{} block only to contain the call to nodeEnv.nodeDataPaths(), would you reduce its scope? That way you don't need a local lockAssertionFailed, you can set brokenLock and exit immediately.
| if (enabled == false) { | ||
| statusInfo = new StatusInfo(HEALTHY, "health check disabled"); | ||
| } else if (brokenLock == true) { | ||
| statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording nit: specify which lock was broken (and remove redundant on node), suggest this:
| statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks"); | |
| statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock"); |
| assert pathPrefix != null : "must set pathPrefix before starting disruptions"; | ||
| if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) { | ||
| if (path.toString().startsWith(pathPrefix) && path.toString(). | ||
| endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| assert pathPrefix != null : "must set pathPrefix before starting disruptions"; | ||
| if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) { | ||
| if (path.toString().startsWith(pathPrefix) && path.toString(). | ||
| endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
| } | ||
|
|
||
| public void testFailsHealthOnMissingLockFile() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thorough tests 😄 However they're not really testing anything in the FsHealthService so much as testing the details of the implementation of the NativeFSLock. Let's just have one of these here, and maybe consider filling in any gaps in Lucene's TestNativeFSLockFactory separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i am thinking to keep two of them where one throws an IOException and another for AlreadyClosedException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one is all we need here.
NodeEnvironmentTests would be the right place to verify that NodeEnvironment#assertEnvIsLocked throws an IllegalStateException in both of those cases. I think we don't do that today, but again that's a question for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, Got it.
Thanks @DaveCTurner, I have made changes as suggested. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue remains (and another minor wording change)
| private void monitorFSHealth() { | ||
| Set<Path> currentUnhealthyPaths = null; | ||
| for (Path path : nodeEnv.nodeDataPaths()) { | ||
| brokenLock = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearing this flag here may mean the node reports itself as healthy even though it hasn't actually passed this health check yet. I think we should clear this flag only after setting unhealthyPaths at the very bottom of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done.
| try { | ||
| paths = nodeEnv.nodeDataPaths(); | ||
| } catch (IllegalStateException e) { | ||
| logger.error("Lock assertions failed due to", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording nit:
| logger.error("Lock assertions failed due to", e); | |
| logger.error("health check failed", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed 👍
|
@DaveCTurner i have made changes as suggested. |
|
@DaveCTurner let me know if any change etc is required, if you got chance to look at it. |
|
@elasticmachine update branch |
|
@elasticmachine ok to test |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Pinging @elastic/es-core-infra (:Core/Infra/Resiliency) |
In #52680 we introduced a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. This commit extends this check to consider a failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of unhealthiness. Closes #58373
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of
NodeEnvironment#assertEnvIsLocked()to be an indication ofunhealthiness.
Closes #58373