HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission #3573

adoroszlai · 2022-06-30T18:07:00Z

What changes were proposed in this pull request?

Extract a common interface for ContainerReplicaCount and ECContainerReplicaCount to allow DatanodeAdminMonitorImpl to handle both EC and non-EC containers.

https://issues.apache.org/jira/browse/HDDS-6970

How was this patch tested?

Full CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/2590660510

…uring decommission

sodonnel · 2022-07-01T08:54:58Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerIdenticalReplicaCount.java

+  private final ContainerInfo container;
+  private final Set<ContainerReplica> replica;
+
+  public ContainerIdenticalReplicaCount(ContainerInfo container,


I'm not sure about the name of the class. As the other one is called ECContainerReplicaCount, would this be better as RatisContainerReplicaCount, or ReplicatedContainerReplicaCount maybe?

sodonnel · 2022-07-01T09:01:39Z

...dds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaCount.java

   */
-  public boolean isMissing() {
-    return replica.size() == 0;
+  default boolean isMissing() {


For EC, we have a method public boolean unRecoverable(). The definition of missing for EC is a bit strange. For Ratis it is very clear - there are no replicas available at all. For EC, a container is effectively missing if there are no dataNum containers available. We probably need to override this in the EC class and have it return the the result of unRecoverable.

Done, thanks. What do you think about merging these two methods, e.g. renaming isMissing to unRecoverable (which seems more general) or vice versa?

Note: I think isUnrecoverable would be a more standard name for unRecoverable.

Yea we could rename isMissing to unRecoverable. I think I originally has isMissing in the EC class, but Uma asked me to change it to unRecoverable. Would make sense to standardize on unRecoverable both classes I think.

Go ahead with isUnrecoverable - I think it is best.

sodonnel · 2022-07-01T09:08:39Z

...dds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaCount.java

-  public boolean isOverReplicated() {
-    return missingReplicas() + inFlightDel < 0;
-  }
+  int additionalReplicaNeeded();


I wonder if we should omit this from the interface. For EC, its a tricky calculation and probably not that useful. It could say 2 additional replicas needed, but its not too helpful, as we don't know what indexes, or if its a reconstruction or an easy copy from a decommissioning node. It feels like this doesn't apply well to the common methods, and I don't think its used inside the decommission code.

Yep, turns out it's not needed.

sodonnel · 2022-07-01T09:29:08Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+        containerInfo.containerID());
+    List<ContainerReplicaOp> pendingOps =
+        containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
+    return new ECContainerReplicaCount(containerInfo, replicas, pendingOps, 0);


The zero at the end of the parameters is definitely not correct, but I don't know yet what is the correct value to put here.

Could you add a TODO - define maintenance redundancy for EC (HDDS-6975) here?

We will need to fix this in a couple of places I think (not related to this PR).

sodonnel · 2022-07-01T10:52:18Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+        containerInfo.containerID());
+    List<ContainerReplicaOp> pendingOps =
+        containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
+    // TODO: define maintenance redundancy for EC


Is it worth adding the Jira number here too - (HDDS-6975 ?

sodonnel

These changes LGTM. I think we were also going to change isMissing() to isUnrecoverable?

adoroszlai · 2022-07-01T17:38:25Z

Thanks @sodonnel for the review.

* master: (46 commits) HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532) HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585) HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578) HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514) HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564) HDDS-6938. handle NPE when removing prefixAcl (apache#3568) HDDS-6960. EC: Implement the Over-replication Handler (apache#3572) HDDS-6979. Remove unused plexus dependency declaration (apache#3579) HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574) HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400) HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553) HDDS-6965. Increase timeout for basic check (apache#3563) HDDS-6969. Add link to compose directory in smoketest README (apache#3567) HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573) HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575) HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012) HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle HDDS-6773. Cleanup TestRDBTableStore (apache#3434) HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573) (cherry picked from commit a6500f6) Change-Id: I2ef38b143e541c47d1988e3a1a42248620699c53

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers d…

052fae3

…uring decommission

adoroszlai self-assigned this Jun 30, 2022

adoroszlai added the EC label Jun 30, 2022

sodonnel reviewed Jul 1, 2022

View reviewed changes

adoroszlai added 2 commits July 1, 2022 11:20

Address review comments

0cc46ab

Address warnings

957fada

sodonnel reviewed Jul 1, 2022

View reviewed changes

Add TODO for remainingRedundancyForMaintenance=0

eff1b3c

sodonnel reviewed Jul 1, 2022

View reviewed changes

Add Jira ID in TODO

2f52e46

sodonnel approved these changes Jul 1, 2022

View reviewed changes

adoroszlai added 2 commits July 1, 2022 18:10

Merge remote-tracking branch 'origin/master' into HDDS-6970

7f4fab3

Unify isMissing and unRecoverable as isUnrecoverable

a98c8fc

adoroszlai merged commit a6500f6 into apache:master Jul 1, 2022

adoroszlai deleted the HDDS-6970 branch July 1, 2022 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission #3573

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission #3573

Uh oh!

adoroszlai commented Jun 30, 2022

Uh oh!

sodonnel Jul 1, 2022

Uh oh!

sodonnel Jul 1, 2022 •

edited

Loading

Uh oh!

adoroszlai Jul 1, 2022

Uh oh!

sodonnel Jul 1, 2022

Uh oh!

sodonnel Jul 1, 2022

Uh oh!

sodonnel Jul 1, 2022

Uh oh!

adoroszlai Jul 1, 2022

Uh oh!

sodonnel Jul 1, 2022 •

edited

Loading

Uh oh!

sodonnel Jul 1, 2022

Uh oh!

sodonnel left a comment

Uh oh!

adoroszlai commented Jul 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission #3573

HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission #3573

Uh oh!

Conversation

adoroszlai commented Jun 30, 2022

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sodonnel Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

adoroszlai Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jul 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sodonnel Jul 1, 2022 •

edited

Loading

sodonnel Jul 1, 2022 •

edited

Loading