Skip to content

Conversation

@amoghRZP
Copy link
Contributor

@amoghRZP amoghRZP commented Aug 21, 2020

In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of NodeEnvironment#assertEnvIsLocked() to be an indication of
unhealthiness.

Closes #58373

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.

}
currentUnhealthyPaths.add(path);
}
} catch (IllegalStateException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the try{} block only to contain the call to nodeEnv.nodeDataPaths(), would you reduce its scope? That way you don't need a local lockAssertionFailed, you can set brokenLock and exit immediately.

if (enabled == false) {
statusInfo = new StatusInfo(HEALTHY, "health check disabled");
} else if (brokenLock == true) {
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor wording nit: specify which lock was broken (and remove redundant on node), suggest this:

Suggested change
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks");
statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock");

assert pathPrefix != null : "must set pathPrefix before starting disruptions";
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) {
if (path.toString().startsWith(pathPrefix) && path.toString().
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert pathPrefix != null : "must set pathPrefix before starting disruptions";
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) {
if (path.toString().startsWith(pathPrefix) && path.toString().
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
}

public void testFailsHealthOnMissingLockFile() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough tests 😄 However they're not really testing anything in the FsHealthService so much as testing the details of the implementation of the NativeFSLock. Let's just have one of these here, and maybe consider filling in any gaps in Lucene's TestNativeFSLockFactory separately.

Copy link
Contributor Author

@amoghRZP amoghRZP Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i am thinking to keep two of them where one throws an IOException and another for AlreadyClosedException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one is all we need here.

NodeEnvironmentTests would be the right place to verify that NodeEnvironment#assertEnvIsLocked throws an IllegalStateException in both of those cases. I think we don't do that today, but again that's a question for a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Got it.

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Aug 24, 2020

Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.

Thanks @DaveCTurner, I have made changes as suggested.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue remains (and another minor wording change)

private void monitorFSHealth() {
Set<Path> currentUnhealthyPaths = null;
for (Path path : nodeEnv.nodeDataPaths()) {
brokenLock = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing this flag here may mean the node reports itself as healthy even though it hasn't actually passed this health check yet. I think we should clear this flag only after setting unhealthyPaths at the very bottom of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done.

try {
paths = nodeEnv.nodeDataPaths();
} catch (IllegalStateException e) {
logger.error("Lock assertions failed due to", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor wording nit:

Suggested change
logger.error("Lock assertions failed due to", e);
logger.error("health check failed", e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed 👍

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Sep 1, 2020

@DaveCTurner i have made changes as suggested.

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Sep 8, 2020

@DaveCTurner let me know if any change etc is required, if you got chance to look at it.

@DaveCTurner
Copy link
Contributor

@elasticmachine update branch

@DaveCTurner
Copy link
Contributor

@elasticmachine ok to test

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner changed the title Remove node from cluster when node locks are broken. Remove node from cluster when node locks are broken Sep 22, 2020
@DaveCTurner DaveCTurner changed the title Remove node from cluster when node locks are broken Remove node from cluster when node locks broken Sep 22, 2020
@DaveCTurner DaveCTurner added :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. v7.10.0 v8.0.0 labels Sep 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Resiliency)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 22, 2020
@DaveCTurner DaveCTurner merged commit 71d0958 into elastic:master Sep 22, 2020
DaveCTurner pushed a commit that referenced this pull request Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
@amoghRZP amoghRZP deleted the broken_nl_handling branch September 22, 2020 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. >enhancement Team:Core/Infra Meta label for core/infra team v7.10.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove node from cluster when node locks are broken

4 participants