Skip to content

Conversation

@sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Dec 11, 2019

What changes were proposed in this pull request?

Normally, when a node goes dead, the DeadNodeHandler removes all the containers and replica associated with the node from the ContainerManager.

If a node is IN_MAINTENANCE and goes dead, then we do not want to remove its replica. They should remain present in the system to prevent the container being marked as under-replicated.

We also need to consider the case where the node is dead, and then maintenance expires automatically. In that case, the replica associated with the node must be removed and the affected containers will become under-replicated.

This PR resolves this issue, by ensuring stale, dead and healthy node events are fired as normal by the NodeStateManager when the node health state changes. This is works as before, prior to any decommission related changes.

To ensure the the container replicas are not removed for a node IN_MAINTENANCE, the DeadNodeHandler has been modified to check the node state and skip removing the replicas if the node is IN_MAINTENANE. For all other states, the dead node handling is unchanged.

Then, for any changes in the node operational state, the same health events are fired when the operational state is changed. This means that a DECOMMISSIONED or IN_MAINTENANCE node that is heartbeating and healthy will fire the HEALTHY node event when it is moved to IN_SERVICE, which triggers pipeline creation.

Similarly, if a node was IN_MAINTENANCE and is DEAD, when it transitions to IN_SERVICE, the dead node event will be trigger, causing the container replicas to be removed. Some of these events will have no effect, but the volume of changes to the node operational state will be minimal, as it will only be changed by the decommission monitor or someone triggering or cancelling decommission and maintenance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2607

How was this patch tested?

Additional and modified unit tests

@sodonnel sodonnel force-pushed the HDDS-2607-dead-handling branch from 6109a94 to ae23eff Compare December 13, 2019 12:26
@sodonnel
Copy link
Contributor Author

This change partly tested by modifying the single test in TestDeadNodeHandler(), however due to HDDS-2508 this test is Ignored (passes locally, failing under CI). Therefore I ran the test manually a few times and it always passed and the test should be re-enabled when the mentioned issue is resolved.

@sodonnel sodonnel requested a review from elek December 13, 2019 16:39
S O'Donnell added 2 commits December 14, 2019 14:16
…Also ensure container replicas are not removed when an IN_MAINTENANCE node goes dead
@sodonnel sodonnel force-pushed the HDDS-2607-dead-handling branch from e9320c3 to 89d8427 Compare December 14, 2019 14:55
import org.junit.Before;
import org.junit.Ignore;
import org.junit.Test;
import org.junit.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, explicit imports are preferred over *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, caught out by Intellij again! I have fixed this, and also, hopefully fixed my Intellij settings to stop it doing this.


nodeOpStateSM.addTransition(
NodeOperationalState.IN_SERVICE, NodeOperationalState.DECOMMISSIONING,
NodeOperationStateEvent.START_DECOMMISSION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now enum NodeOperationStateEvent is unused. Can it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this can be removed. I have made this change.

removeContainerReplicas(datanodeDetails);
// Remove the container replicas associated with the dead node unless it
// is IN_MAINTENANCE
if (!nodeManager.getNodeStatus(datanodeDetails).isInMaintenance()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get this message for a node ENTERING_MAINTENANCE? If so, shouldn't it be ignored here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node goes dead while "entering maintenance" then it should be handled as a dead node. The reason, is that we have not yet determine if all the containers on the host are sufficiently replicated. If it goes dead before that check has completed and the node has moved into IN_MAINTENANCE, then it must be handled as a dead node as normal.

This is the same for a node which is DECOMMISSIONING and not yet reached DECOMMISSIONED.

Unless the node has reached its end state (IN_MAINTENANCE or DECOMMISSIONED), if it goes dead the workflow is aborted and it is handled just like any other dead node.

Comment on lines 219 to 220
assertEquals("NON_HEALTHY_TO_HEALTHY_NODE",
eventPublisher.getLastEvent().getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to compare the event, not its name:

      assertEquals(SCMEvents.NON_HEALTHY_TO_HEALTHY_NODE,
          eventPublisher.getLastEvent());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I was pretty much copying existing code when I added this test. I have made this change here and also applied it to the other occurrences in the class to keep in consistent.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @sodonnel for updating the patch.

@adoroszlai adoroszlai merged commit ec1df69 into apache:HDDS-1880-Decom Dec 17, 2019
@adoroszlai
Copy link
Contributor

Thanks @sodonnel for the contribution and @swagle for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants