Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

Mis Replication is another form of under replication, but with a lower priority. Therefore it would make sense that mis-replicated containers are queued in the same queue as under-replicated. Ideally, we would not spend resources fixing mis-replicated until all the under replicated containers are handled, so having them in separate queues doesn't add any value.

Mis-Replicated containers are queued with a priority lower than decommissioning containers in this PR, so they do not block decommission from completing if they are ahead of it in the queue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7654

How was this patch tested?

Added a unit test to prove the mis-rep container queues with a lower priority than decommission. Small changes to existing tests cover most of the rest of the changes.

Ideally we should have a test in TestReplicationManager to cover the logic in ReplicationManager.processUnderReplicatedContainer but the way things are currently structured, that would need a bit of refactoring to be feasible, as ideally we should inject the under / over / mis-rep handler and then we can mock them to ensure the correct one is called. At the moment, they are created inside ReplicationManager so that is not possible.

* under / over replicated and decommissioning replicas in the under
* replication queue.
*/
private static final int MIS_REP_REDUNDANCY = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of constant value, can we have some priority based on the misreplication count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced that mis-rep count really means much and it is worth ordering them further, as the container already has enough replicas. We want these to start with a lower priority than anything else (under rep, decommission), which setting it to 6 does. After that, we can only lower the priority further. If we have a mis rep count of 1 or 2 for a 3-2 and a 6-3 container, which is higher priority? I'm not sure about a good way to do this.

if (result.getContainerInfo().getReplicationType() == EC) {
return ecUnderReplicationHandler.processAndCreateCommands(replicas,
pendingOps, result, maintenanceRedundancy);
if (result.getHealthState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a separate method to processMisreplicatedContainer & have this if case in UnderReplicatedProcessor class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably making it into a switch case would be a good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should have a single processUnhealthyContainer method, and in it, have a switch / IF that caters for EC / Ratis and then Over / Under / Mis replicated. But we should also add some tests for that, to ensure the correct handler is called in the correct circumstances. However the way RM is currently we would need to change it to inject the handlers, so I think we should clean this up in another refactor jira.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @swamirishi we can merge this unless you have any other comments.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kerneltime kerneltime merged commit ef48076 into apache:master Dec 19, 2022
@kerneltime
Copy link
Contributor

Thank you @sodonnel !

errose28 added a commit to errose28/ozone that referenced this pull request Dec 21, 2022
* master: (88 commits)
  HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. (apache#4093)
  HDDS-7683. EC: ReplicationManager - UnderRep maintenance handler should not request nodes if none needed (apache#4109)
  HDDS-7635. Update failure metrics when allocate block fails in preExecute. (apache#4086)
  HDDS-7565. FSO purge directory for old bucket can update quota for new bucket (apache#4021)
  HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue (apache#4099)
  HDDS-7621. Update SCM term in datanode from heartbeat without any commands (apache#4101)
  HDDS-7649. S3 multipart upload EC release space quota wrong for old version (apache#4095)
  HDDS-7399. Enable specifying external root ca (apache#4053)
  HDDS-7398. Tool to remove old certs from the scm db (apache#3972)
  HDDS-6650. S3MultipartUpload support update bucket usedNamespace. (apache#4081)
  HDDS-7605. Improve logging in Container Balancer (apache#4067)
  HDDS-7616. EC: Refactor Unhealthy Replicated Processor (apache#4063)
  HDDS-7426. Add a new acceptance test for Streaming Pipeline. (apache#4019)
  HDDS-7478. [Ozone-Streaming] NPE in when creating a file with o3fs. (apache#3949)
  HDDS-7425. Add documentation for the new Streaming Pipeline feature. (apache#3913)
  HDDS-7438. [Ozone-Streaming] Add a createStreamKey method to OzoneBucket. (apache#3914)
  HDDS-7431. [Ozone-Streaming] Disable data steam by default. (apache#3900)
  HDDS-6955. [Ozone-streaming] Add explicit stream flag in ozone shell (apache#3559)
  HDDS-6867.  [Ozone-Streaming] PutKeyHandler should not use streaming to put EC key. (apache#3516)
  HDDS-6842. [Ozone-Streaming] Reduce the number of watch requests in StreamCommitWatcher. (apache#3492)
  ...
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023
…plicated queue (apache#4099)

Co-authored-by: S O'Donnell <[email protected]>
(cherry picked from commit ef48076)
Change-Id: I2f3fea3e634e6e34ce63b1bd05fc06c0e75ff094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants