Skip to content

Conversation

@GlenGeng-awx
Copy link
Contributor

@GlenGeng-awx GlenGeng-awx commented Oct 14, 2020

What changes were proposed in this pull request?

ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly

      // If there are unhealthy replicas, then we should remove them even if it
      // makes the container violate the placement policy, as excess unhealthy
      // containers are not really useful. It will be corrected later as a
      // mis-replicated container will be seen as under-replicated.
      for (ContainerReplica r : unhealthyReplicas) {
        if (excess > 0) {
          sendDeleteCommand(container, r.getDatanodeDetails(), true);
          excess -= 1;
        }
        break;
      }
      // After removing all unhealthy replicas, if the container is still over
      // replicated then we need to check if it is already mis-replicated.
      // If it is, we do no harm by removing excess replicas. However, if it is
      // not mis-replicated, then we can only remove replicas if they don't
      // make the container become mis-replicated.

From the comment, it wants to remove all unhealthy replicas until excess reach 0 ? It should be

      for (ContainerReplica r : unhealthyReplicas) {
        if (excess > 0) {
          sendDeleteCommand(container, r.getDatanodeDetails(), true);
          excess -= 1;
        } else {
          break;
        }
      }

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4343#

How was this patch tested?

CI

@GlenGeng-awx
Copy link
Contributor Author

@sodonnel Hi Stephen, Please take a look at this PR. Thanks!

} else {
break;
}
break;
Copy link
Contributor

@linyiqun linyiqun Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this PR change, I am thinking if we still need to do excess check before sending delete command for unhealthy containers. As we want to remove all unhealthy container replicas, we can just send delete command and decrease the excess count by the way.

I prefer simplified the logic like below:

      for (ContainerReplica r : unhealthyReplicas) {
          sendDeleteCommand(container, r.getDatanodeDetails(), true);
          excess -= 1;
      }

The unhealthyReplicas will also removed in ReplicationManager#handleUnstableContainer if we don't remove all them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linyiqun What if there is an excess of containers, but all containers are somehow unhealthy? We don't want RM to remove all copies of them, as that could result in a lost container. That is why we limit the number of unhealthy containers removed to the excess number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, get it. Thanks @sodonnel.

@sodonnel
Copy link
Contributor

@GlenGeng Thanks for highlighting this problem and well done for finding it. The change you have made looks good to me, +1 from my side.

@sodonnel sodonnel self-requested a review October 14, 2020 10:27
} else {
break;
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, get it. Thanks @sodonnel.

@sodonnel sodonnel merged commit 2650723 into apache:master Oct 14, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Oct 19, 2020
* master:
  HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468)
  HDDS-4158. Provide a class type for Java based configuration (apache#1407)
  HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM.
  HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371)
  HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490)
  HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493)
  HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499)
  HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants