Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

Extract a common interface for ContainerReplicaCount and ECContainerReplicaCount to allow DatanodeAdminMonitorImpl to handle both EC and non-EC containers.

https://issues.apache.org/jira/browse/HDDS-6970

How was this patch tested?

Full CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/2590660510

@adoroszlai adoroszlai self-assigned this Jun 30, 2022
@adoroszlai adoroszlai added the EC label Jun 30, 2022
private final ContainerInfo container;
private final Set<ContainerReplica> replica;

public ContainerIdenticalReplicaCount(ContainerInfo container,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the name of the class. As the other one is called ECContainerReplicaCount, would this be better as RatisContainerReplicaCount, or ReplicatedContainerReplicaCount maybe?

*/
public boolean isMissing() {
return replica.size() == 0;
default boolean isMissing() {
Copy link
Contributor

@sodonnel sodonnel Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EC, we have a method public boolean unRecoverable(). The definition of missing for EC is a bit strange. For Ratis it is very clear - there are no replicas available at all. For EC, a container is effectively missing if there are no dataNum containers available. We probably need to override this in the EC class and have it return the the result of unRecoverable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks. What do you think about merging these two methods, e.g. renaming isMissing to unRecoverable (which seems more general) or vice versa?

Note: I think isUnrecoverable would be a more standard name for unRecoverable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we could rename isMissing to unRecoverable. I think I originally has isMissing in the EC class, but Uma asked me to change it to unRecoverable. Would make sense to standardize on unRecoverable both classes I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead with isUnrecoverable - I think it is best.

public boolean isOverReplicated() {
return missingReplicas() + inFlightDel < 0;
}
int additionalReplicaNeeded();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should omit this from the interface. For EC, its a tricky calculation and probably not that useful. It could say 2 additional replicas needed, but its not too helpful, as we don't know what indexes, or if its a reconstruction or an easy copy from a decommissioning node. It feels like this doesn't apply well to the common methods, and I don't think its used inside the decommission code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, turns out it's not needed.

containerInfo.containerID());
List<ContainerReplicaOp> pendingOps =
containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
return new ECContainerReplicaCount(containerInfo, replicas, pendingOps, 0);
Copy link
Contributor

@sodonnel sodonnel Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero at the end of the parameters is definitely not correct, but I don't know yet what is the correct value to put here.

Could you add a TODO - define maintenance redundancy for EC (HDDS-6975) here?

We will need to fix this in a couple of places I think (not related to this PR).

containerInfo.containerID());
List<ContainerReplicaOp> pendingOps =
containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
// TODO: define maintenance redundancy for EC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding the Jira number here too - (HDDS-6975 ?

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes LGTM. I think we were also going to change isMissing() to isUnrecoverable?

@adoroszlai adoroszlai merged commit a6500f6 into apache:master Jul 1, 2022
@adoroszlai adoroszlai deleted the HDDS-6970 branch July 1, 2022 17:38
@adoroszlai
Copy link
Contributor Author

Thanks @sodonnel for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Jul 12, 2022
* master: (46 commits)
  HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532)
  HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585)
  HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583)
  HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578)
  HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514)
  HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564)
  HDDS-6938. handle NPE when removing prefixAcl (apache#3568)
  HDDS-6960. EC: Implement the Over-replication Handler (apache#3572)
  HDDS-6979. Remove unused plexus dependency declaration (apache#3579)
  HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574)
  HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400)
  HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553)
  HDDS-6965. Increase timeout for basic check (apache#3563)
  HDDS-6969. Add link to compose directory in smoketest README (apache#3567)
  HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573)
  HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575)
  HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012)
  HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle
  HDDS-6773. Cleanup TestRDBTableStore (apache#3434)
  HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371)
  ...

Conflicts:
    hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
duongkame pushed a commit to duongkame/ozone that referenced this pull request Aug 16, 2022
HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573)

(cherry picked from commit a6500f6)
Change-Id: I2ef38b143e541c47d1988e3a1a42248620699c53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants