HDDS-2607 DeadNodeHandler should not remove replica for a dead maintenance node #343

sodonnel · 2019-12-11T15:56:35Z

What changes were proposed in this pull request?

Normally, when a node goes dead, the DeadNodeHandler removes all the containers and replica associated with the node from the ContainerManager.

If a node is IN_MAINTENANCE and goes dead, then we do not want to remove its replica. They should remain present in the system to prevent the container being marked as under-replicated.

We also need to consider the case where the node is dead, and then maintenance expires automatically. In that case, the replica associated with the node must be removed and the affected containers will become under-replicated.

This PR resolves this issue, by ensuring stale, dead and healthy node events are fired as normal by the NodeStateManager when the node health state changes. This is works as before, prior to any decommission related changes.

To ensure the the container replicas are not removed for a node IN_MAINTENANCE, the DeadNodeHandler has been modified to check the node state and skip removing the replicas if the node is IN_MAINTENANE. For all other states, the dead node handling is unchanged.

Then, for any changes in the node operational state, the same health events are fired when the operational state is changed. This means that a DECOMMISSIONED or IN_MAINTENANCE node that is heartbeating and healthy will fire the HEALTHY node event when it is moved to IN_SERVICE, which triggers pipeline creation.

Similarly, if a node was IN_MAINTENANCE and is DEAD, when it transitions to IN_SERVICE, the dead node event will be trigger, causing the container replicas to be removed. Some of these events will have no effect, but the volume of changes to the node operational state will be minimal, as it will only be changed by the decommission monitor or someone triggering or cancelling decommission and maintenance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2607

How was this patch tested?

Additional and modified unit tests

sodonnel · 2019-12-13T16:28:06Z

This change partly tested by modifying the single test in TestDeadNodeHandler(), however due to HDDS-2508 this test is Ignored (passes locally, failing under CI). Therefore I ran the test manually a few times and it always passed and the test should be re-enabled when the mentioned issue is resolved.

…Also ensure container replicas are not removed when an IN_MAINTENANCE node goes dead

…known issue tracked in HDDS-2508

swagle · 2019-12-16T18:06:53Z

hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestDeadNodeHandler.java

-import org.junit.Before;
-import org.junit.Ignore;
-import org.junit.Test;
+import org.junit.*;


Generally, explicit imports are preferred over *

Ah, caught out by Intellij again! I have fixed this, and also, hopefully fixed my Intellij settings to stop it doing this.

adoroszlai · 2019-12-17T11:41:27Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStateManager.java

-
-    nodeOpStateSM.addTransition(
-        NodeOperationalState.IN_SERVICE, NodeOperationalState.DECOMMISSIONING,
-        NodeOperationStateEvent.START_DECOMMISSION);


Now enum NodeOperationStateEvent is unused. Can it be removed?

Correct, this can be removed. I have made this change.

adoroszlai · 2019-12-17T11:48:11Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java

-      removeContainerReplicas(datanodeDetails);
+      // Remove the container replicas associated with the dead node unless it
+      // is IN_MAINTENANCE
+      if (!nodeManager.getNodeStatus(datanodeDetails).isInMaintenance()) {


Is it possible to get this message for a node ENTERING_MAINTENANCE? If so, shouldn't it be ignored here as well?

If a node goes dead while "entering maintenance" then it should be handled as a dead node. The reason, is that we have not yet determine if all the containers on the host are sufficiently replicated. If it goes dead before that check has completed and the node has moved into IN_MAINTENANCE, then it must be handled as a dead node as normal.

This is the same for a node which is DECOMMISSIONING and not yet reached DECOMMISSIONED.

Unless the node has reached its end state (IN_MAINTENANCE or DECOMMISSIONED), if it goes dead the workflow is aborted and it is handled just like any other dead node.

adoroszlai · 2019-12-17T11:54:25Z

hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestNodeStateManager.java

+      assertEquals("NON_HEALTHY_TO_HEALTHY_NODE",
+          eventPublisher.getLastEvent().getName());


I think it would be safer to compare the event, not its name:

assertEquals(SCMEvents.NON_HEALTHY_TO_HEALTHY_NODE, eventPublisher.getLastEvent());

True. I was pretty much copying existing code when I added this test. I have made this change here and also applied it to the other occurrences in the class to keep in consistent.

adoroszlai

Thank @sodonnel for updating the patch.

adoroszlai · 2019-12-17T17:40:50Z

Thanks @sodonnel for the contribution and @swagle for the review.

sodonnel force-pushed the HDDS-2607-dead-handling branch from 6109a94 to ae23eff Compare December 13, 2019 12:26

sodonnel requested a review from elek December 13, 2019 16:39

S O'Donnell added 2 commits December 14, 2019 14:16

Ensure health events are fire when a node operational state changes. …

ce30484

…Also ensure container replicas are not removed when an IN_MAINTENANCE node goes dead

Re-add ignore tag on TestDeadNodeHandler.testOnMessage() which has a …

89d8427

…known issue tracked in HDDS-2508

sodonnel force-pushed the HDDS-2607-dead-handling branch from e9320c3 to 89d8427 Compare December 14, 2019 14:55

swagle reviewed Dec 16, 2019

View reviewed changes

adoroszlai reviewed Dec 17, 2019

View reviewed changes

Address review comments

2c131ce

adoroszlai approved these changes Dec 17, 2019

View reviewed changes

adoroszlai merged commit ec1df69 into apache:HDDS-1880-Decom Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-2607 DeadNodeHandler should not remove replica for a dead maintenance node #343

HDDS-2607 DeadNodeHandler should not remove replica for a dead maintenance node #343

Uh oh!

sodonnel commented Dec 11, 2019 •

edited

Loading

Uh oh!

sodonnel commented Dec 13, 2019

Uh oh!

swagle Dec 16, 2019

Uh oh!

sodonnel Dec 17, 2019

Uh oh!

adoroszlai Dec 17, 2019

Uh oh!

sodonnel Dec 17, 2019

Uh oh!

adoroszlai Dec 17, 2019

Uh oh!

sodonnel Dec 17, 2019

Uh oh!

adoroszlai Dec 17, 2019

Uh oh!

sodonnel Dec 17, 2019

Uh oh!

adoroszlai left a comment

Uh oh!

adoroszlai commented Dec 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assertEquals("NON_HEALTHY_TO_HEALTHY_NODE",
		eventPublisher.getLastEvent().getName());

HDDS-2607 DeadNodeHandler should not remove replica for a dead maintenance node #343

HDDS-2607 DeadNodeHandler should not remove replica for a dead maintenance node #343

Uh oh!

Conversation

sodonnel commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sodonnel commented Dec 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Dec 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sodonnel commented Dec 11, 2019 •

edited

Loading