-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-4343: ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. #1495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t handle unhealthyReplicas properly.
|
@sodonnel Hi Stephen, Please take a look at this PR. Thanks! |
| } else { | ||
| break; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this PR change, I am thinking if we still need to do excess check before sending delete command for unhealthy containers. As we want to remove all unhealthy container replicas, we can just send delete command and decrease the excess count by the way.
I prefer simplified the logic like below:
for (ContainerReplica r : unhealthyReplicas) {
sendDeleteCommand(container, r.getDatanodeDetails(), true);
excess -= 1;
}The unhealthyReplicas will also removed in ReplicationManager#handleUnstableContainer if we don't remove all them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linyiqun What if there is an excess of containers, but all containers are somehow unhealthy? We don't want RM to remove all copies of them, as that could result in a lost container. That is why we limit the number of unhealthy containers removed to the excess number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, get it. Thanks @sodonnel.
|
@GlenGeng Thanks for highlighting this problem and well done for finding it. The change you have made looks good to me, +1 from my side. |
| } else { | ||
| break; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, get it. Thanks @sodonnel.
* master: HDDS-4301. SCM CA certificate does not encode KeyUsage extension properly (apache#1468) HDDS-4158. Provide a class type for Java based configuration (apache#1407) HDDS-4297. Allow multiple transactions per container to be sent for deletion by SCM. HDDS-2922. Balance ratis leader distribution in datanodes (apache#1371) HDDS-4269. Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory. (apache#1490) HDDS-4327. Potential resource leakage using BatchOperation. (apache#1493) HDDS-3995. Fix s3g met NPE exception while write file by multiPartUpload (apache#1499) HDDS-4343. ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly. (apache#1495)
What changes were proposed in this pull request?
ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly
From the comment, it wants to remove all unhealthy replicas until excess reach 0 ? It should be
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4343#
How was this patch tested?
CI