HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler#3802
Conversation
swamirishi
left a comment
There was a problem hiding this comment.
Some comments left inline. Have some confusion on certain flows & race condition it would be really great if you could clear my doubts there. Haven't checked unit tests yet.
There was a problem hiding this comment.
Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.
There was a problem hiding this comment.
Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.
There was a problem hiding this comment.
Good point. I have changed this.
There was a problem hiding this comment.
Should we have another constructor which initializes the following arguments?
public UnderReplicatedHealthResult(ContainerInfo containerInfo,
int remainingRedundancy, boolean dueToDecommission, boolean replicatedOkWithPending, boolean unrecoverable,boolean dueToMisReplication, boolean isMisReplicated, boolean isMisReplicatedAfterPending)
There was a problem hiding this comment.
I don't want to change the existing constructor, as then we need to change it everywhere it is used. Adding a new constructor starts a bad pattern where each new parameter needs a new constructor, and what we really need is a builder.
At the moment I think these 3 parameters have good defaults for the common case and then using the settings when needed is a good compromise.
There was a problem hiding this comment.
We could add a boolean with includePendingAdd as well. I see redundant duplicate code logic in sufficientlyReplicated & isOverReplicated.
There was a problem hiding this comment.
Yea the logic is very similar. I have added a new private method both can call.
...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Do we need to make this method as public?
There was a problem hiding this comment.
Probably could add @VisibleForTesting if you want to add unit tests or could even change the access specifier using replication using reflection.
There was a problem hiding this comment.
It will need to be public when we implement the under / over replication handler to process the under / over replicated queue. This is the same as in the ECHandler, where it has this same method public for that reason.
There was a problem hiding this comment.
| getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST); | |
| getPlacementStatus(replicas, requiredNodes, Collections.emptyList()); |
There was a problem hiding this comment.
Can there be a case where there is pending ADD & pending DELETE to the same node? Some kind of race condition.
There was a problem hiding this comment.
Better to use a Map<DataNodeDetails,Integer> in that case.
There was a problem hiding this comment.
This should not be able to happen, as if a node has a replica it cannot get another copy of it. For a delete to be scheduled it must have a copy which will prevent an add etc.
However I will change this to two IF statements rather than if .. else if
siddhantsangwan
left a comment
There was a problem hiding this comment.
@sodonnel The handling logic looks good. I haven't checked the tests yet.
There was a problem hiding this comment.
Are we deliberately not considering pending deletes here?
There was a problem hiding this comment.
I think you are correct - I have missed this. We should be removing the inflight deletes as per the original method defined just above this one. I will fix this and modidy a test to validate it.
There was a problem hiding this comment.
NIT: Let's add a new line after this brace for better readability.
There was a problem hiding this comment.
ok - I added that in.
19657b8 to
520c6b9
Compare
siddhantsangwan
left a comment
There was a problem hiding this comment.
Changes look good! I just have 2 minor comments for the tests.
There was a problem hiding this comment.
Since we're testing the HEALTHY case, let's change the test's name?
There was a problem hiding this comment.
OK - I added IsHealthy to the end of it.
There was a problem hiding this comment.
The replica index should be 0 instead of 2 I think. (IN_MAINTENANCE, 0)
There was a problem hiding this comment.
yes, well spotted.
siddhantsangwan
left a comment
There was a problem hiding this comment.
LGTM, pending green CI.
What changes were proposed in this pull request?
Create a handler for the new replication manager to process Ratis container and detect under / over / mis-replication issues.
The logic is largely unchanged from the LegacyReplication manager - simply packaged into the new "handler" structure.
At the moment, this code will not be executed by the new replication manager, as all non-EC container will be directed to the Legacy Replication Manager for processing.
This Jira is part of the work to remove the Legacy Replication Manager.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7058
How was this patch tested?
New unit tests