HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495

GlenGeng-awx · 2020-10-14T08:51:42Z

What changes were proposed in this pull request?

ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly

      // If there are unhealthy replicas, then we should remove them even if it
      // makes the container violate the placement policy, as excess unhealthy
      // containers are not really useful. It will be corrected later as a
      // mis-replicated container will be seen as under-replicated.
      for (ContainerReplica r : unhealthyReplicas) {
        if (excess > 0) {
          sendDeleteCommand(container, r.getDatanodeDetails(), true);
          excess -= 1;
        }
        break;
      }
      // After removing all unhealthy replicas, if the container is still over
      // replicated then we need to check if it is already mis-replicated.
      // If it is, we do no harm by removing excess replicas. However, if it is
      // not mis-replicated, then we can only remove replicas if they don't
      // make the container become mis-replicated.

From the comment, it wants to remove all unhealthy replicas until excess reach 0 ? It should be

      for (ContainerReplica r : unhealthyReplicas) {
        if (excess > 0) {
          sendDeleteCommand(container, r.getDatanodeDetails(), true);
          excess -= 1;
        } else {
          break;
        }
      }

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4343#

How was this patch tested?

CI

…t handle unhealthyReplicas properly.

GlenGeng-awx · 2020-10-14T08:55:33Z

@sodonnel Hi Stephen, Please take a look at this PR. Thanks!

linyiqun · 2020-10-14T09:36:48Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+        } else {
+          break;
        }
-        break;


Looking into this PR change, I am thinking if we still need to do excess check before sending delete command for unhealthy containers. As we want to remove all unhealthy container replicas, we can just send delete command and decrease the excess count by the way.

I prefer simplified the logic like below:

for (ContainerReplica r : unhealthyReplicas) { sendDeleteCommand(container, r.getDatanodeDetails(), true); excess -= 1; }

The unhealthyReplicas will also removed in ReplicationManager#handleUnstableContainer if we don't remove all them here.

@linyiqun What if there is an excess of containers, but all containers are somehow unhealthy? We don't want RM to remove all copies of them, as that could result in a lost container. That is why we limit the number of unhealthy containers removed to the excess number.

Okay, get it. Thanks @sodonnel.

sodonnel · 2020-10-14T10:27:46Z

@GlenGeng Thanks for highlighting this problem and well done for finding it. The change you have made looks good to me, +1 from my side.

linyiqun · 2020-10-14T10:55:26Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+        } else {
+          break;
        }
-        break;


Okay, get it. Thanks @sodonnel.

* master: HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468) HDDS-4158. Provide a class type for Java based configuration (apache#1407) HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM. HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371) HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490) HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493) HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499) HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)

HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does no…

e73042e

…t handle unhealthyReplicas properly.

linyiqun reviewed Oct 14, 2020

View reviewed changes

sodonnel self-requested a review October 14, 2020 10:27

sodonnel approved these changes Oct 14, 2020

View reviewed changes

linyiqun approved these changes Oct 14, 2020

View reviewed changes

sodonnel merged commit 2650723 into apache:master Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495

HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495

Uh oh!

GlenGeng-awx commented Oct 14, 2020 •

edited

Loading

Uh oh!

GlenGeng-awx commented Oct 14, 2020

Uh oh!

linyiqun Oct 14, 2020 •

edited

Loading

Uh oh!

sodonnel Oct 14, 2020

Uh oh!

linyiqun Oct 14, 2020

Uh oh!

sodonnel commented Oct 14, 2020

Uh oh!

linyiqun Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495

HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495

Uh oh!

Conversation

GlenGeng-awx commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx commented Oct 14, 2020

Uh oh!

linyiqun Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

linyiqun Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Oct 14, 2020

Uh oh!

linyiqun Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GlenGeng-awx commented Oct 14, 2020 •

edited

Loading

linyiqun Oct 14, 2020 •

edited

Loading