HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue #4099

sodonnel · 2022-12-16T16:05:40Z

What changes were proposed in this pull request?

Mis Replication is another form of under replication, but with a lower priority. Therefore it would make sense that mis-replicated containers are queued in the same queue as under-replicated. Ideally, we would not spend resources fixing mis-replicated until all the under replicated containers are handled, so having them in separate queues doesn't add any value.

Mis-Replicated containers are queued with a priority lower than decommissioning containers in this PR, so they do not block decommission from completing if they are ahead of it in the queue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7654

How was this patch tested?

Added a unit test to prove the mis-rep container queues with a lower priority than decommission. Small changes to existing tests cover most of the rest of the changes.

Ideally we should have a test in TestReplicationManager to cover the logic in ReplicationManager.processUnderReplicatedContainer but the way things are currently structured, that would need a bit of refactoring to be feasible, as ideally we should inject the under / over / mis-rep handler and then we can mock them to ensure the correct one is called. At the moment, they are created inside ReplicationManager so that is not possible.

…plicated queue

swamirishi · 2022-12-16T19:36:29Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java

+     * under / over replicated and decommissioning replicas in the under
+     * replication queue.
+     */
+    private static final int MIS_REP_REDUNDANCY = 6;


Instead of constant value, can we have some priority based on the misreplication count

I'm not really convinced that mis-rep count really means much and it is worth ordering them further, as the container already has enough replicas. We want these to start with a lower priority than anything else (under rep, decommission), which setting it to 6 does. After that, we can only lower the priority further. If we have a mis rep count of 1 or 2 for a 3-2 and a 6-3 container, which is higher priority? I'm not sure about a good way to do this.

swamirishi · 2022-12-16T19:38:45Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

    if (result.getContainerInfo().getReplicationType() == EC) {
-      return ecUnderReplicationHandler.processAndCreateCommands(replicas,
-          pendingOps, result, maintenanceRedundancy);
+      if (result.getHealthState()


Can we create a separate method to processMisreplicatedContainer & have this if case in UnderReplicatedProcessor class?

nit: probably making it into a switch case would be a good.

I actually think we should have a single processUnhealthyContainer method, and in it, have a switch / IF that caters for EC / Ratis and then Over / Under / Mis replicated. But we should also add some tests for that, to ensure the correct handler is called in the correct circumstances. However the way RM is currently we would need to change it to inject the handlers, so I think we should clean this up in another refactor jira.

siddhantsangwan

Looks good to me. @swamirishi we can merge this unless you have any other comments.

swamirishi

LGTM

kerneltime · 2022-12-19T17:29:30Z

Thank you @sodonnel !

* master: (88 commits) HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. (apache#4093) HDDS-7683. EC: ReplicationManager - UnderRep maintenance handler should not request nodes if none needed (apache#4109) HDDS-7635. Update failure metrics when allocate block fails in preExecute. (apache#4086) HDDS-7565. FSO purge directory for old bucket can update quota for new bucket (apache#4021) HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue (apache#4099) HDDS-7621. Update SCM term in datanode from heartbeat without any commands (apache#4101) HDDS-7649. S3 multipart upload EC release space quota wrong for old version (apache#4095) HDDS-7399. Enable specifying external root ca (apache#4053) HDDS-7398. Tool to remove old certs from the scm db (apache#3972) HDDS-6650. S3MultipartUpload support update bucket usedNamespace. (apache#4081) HDDS-7605. Improve logging in Container Balancer (apache#4067) HDDS-7616. EC: Refactor Unhealthy Replicated Processor (apache#4063) HDDS-7426. Add a new acceptance test for Streaming Pipeline. (apache#4019) HDDS-7478. [Ozone-Streaming] NPE in when creating a file with o3fs. (apache#3949) HDDS-7425. Add documentation for the new Streaming Pipeline feature. (apache#3913) HDDS-7438. [Ozone-Streaming] Add a createStreamKey method to OzoneBucket. (apache#3914) HDDS-7431. [Ozone-Streaming] Disable data steam by default. (apache#3900) HDDS-6955. [Ozone-streaming] Add explicit stream flag in ozone shell (apache#3559) HDDS-6867. [Ozone-Streaming] PutKeyHandler should not use streaming to put EC key. (apache#3516) HDDS-6842. [Ozone-Streaming] Reduce the number of watch requests in StreamCommitWatcher. (apache#3492) ...

…plicated queue (apache#4099) Co-authored-by: S O'Donnell <[email protected]> (cherry picked from commit ef48076) Change-Id: I2f3fea3e634e6e34ce63b1bd05fc06c0e75ff094

HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under re…

be05320

…plicated queue

swamirishi requested changes Dec 16, 2022

View reviewed changes

siddhantsangwan approved these changes Dec 19, 2022

View reviewed changes

swamirishi approved these changes Dec 19, 2022

View reviewed changes

kerneltime merged commit ef48076 into apache:master Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue #4099

HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue #4099

Uh oh!

sodonnel commented Dec 16, 2022

Uh oh!

swamirishi Dec 16, 2022

Uh oh!

sodonnel Dec 16, 2022

Uh oh!

swamirishi Dec 16, 2022

Uh oh!

swamirishi Dec 16, 2022

Uh oh!

sodonnel Dec 16, 2022

Uh oh!

siddhantsangwan left a comment

Uh oh!

swamirishi left a comment

Uh oh!

kerneltime commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue #4099

HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue #4099

Uh oh!

Conversation

sodonnel commented Dec 16, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

swamirishi Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

swamirishi Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

swamirishi Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

kerneltime commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants