HDDS-6447. Refine SCM handling of unhealthy container replicas #3258

hanishakoneru · 2022-03-30T22:40:30Z

What changes were proposed in this pull request?

Currently, containers are marked UNHEALTHY by Container Scrubber for one of the following reasons:

If an operation fails on an open/ closing container, it is marked unhealthy so that subsequent write transactions also fail.
If Container Scrubber is enabled and ContainerMetadataScanner detects an error during KeyValueContainerCheck#fastCheck().
- Metadata path or Chunks path is not accessible as a directory
- Container checksum verification fails
- On-disk Container Yaml data does not match in-memory container data (ContainerType, ContainerID, Container DBType, Metadata Path)
If Container Scrubber is enabled and ContainerDataScanner (runs only on closed and quasi-closed containers) detects any block with missing or corrupted chunks file.

If a container in “open” state in SCM is marked unhealthy (in the container report), SCM asks the DNs to close the container. But for a “closing” container with an “unhealthy” replica, SCM leaves the container replica as is.

Some of the issues with how unhealthy containers are handled:

If ReplicationManager does not find a healthy replica for a container, it does not replicate that container. So if there is only one replica of a container and it is unhealthy, SCM will never replicate it and there is potential for data loss if that single replica is lost for any reason (for example: disk failure).
If there is a Quasi-Closed replica and an Unhealthy container, SCM will delete the unhealthy container. SCM does not check if the unhealthy replica has higher bcsId.
Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated). RelicationManager will delete one of these quesi-closed replicas (handleUnstableContainer) and then in the next cycle replicate it again as container would now be under-replicated (handleUnderreplicatedContainer). This will become a loop of replicating and deleting the container replica.

SCM should be more conservative with deleting unhealthy containers as they could possibly be recovered. This Jira proposes to

Let SCM replicate an unhealthy container if there is no other healthy replicas.
If all the replicas are unstable (either unhealthy or quasi-closed with lesser bcsId than container), then no replica should be deleted.
An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6447

How was this patch tested?

Add tests in TestReplicationManager

guihecheng · 2022-04-06T03:52:52Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

      final List<ContainerReplica> unhealthyReplicas = eligibleReplicas
          .stream()
          .filter(r -> !compareState(container.getState(), r.getState()))
+          .filter(r -> r.getSequenceId() > container.getSequenceId())


Maybe a typo, should be r.getSequenceId() < container.getSequenceId() as the comments stated?

Yes. Thanks @guihecheng for catching this.

errose28

Thanks for working on this @hanishakoneru. I think it would be very helpful to add tests to TestReplicationManager for these cases. We could add test cases for each of these scenarios involving unhealthy/quasi-closed replicas with different BCSIDs to enforce and document expected replication manager behavior in these situations.

errose28 · 2022-04-07T00:35:22Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+
+      // If there is only 1 replica of a container remaining, replicate it
+      // even if it is unhealthy.
+      if (source.size() == 0 && replicas.size() == 1) {


If the goal is to preserve unhealthy containers, shouldn't we replicate them if they are at all under replicated, not just the last remaining one?

Updated the PR to replicate even unhealthy containers if closed-quasi closed are not available.

hanishakoneru · 2022-04-08T21:53:52Z

Thanks @guihecheng and @errose28 for the reviews. I addressed the review comments and added some unit tests.

errose28

Thanks for adding tests @hanishakoneru. Can we also add a test for this case described in the PR description?

Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated). RelicationManager will delete one of these quesi-closed replicas (handleUnstableContainer) and then in the next cycle replicate it again as container would now be under-replicated (handleUnderreplicatedContainer). This will become a loop of replicating and deleting the container replica.

errose28 · 2022-04-19T00:20:34Z

...ds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestReplicationManager.java

+    final ContainerInfo container = createContainer(LifeCycleState.CLOSED);
+    addReplica(container, NodeStatus.inServiceHealthy(), QUASI_CLOSED, 990L);
+    addReplica(container, NodeStatus.inServiceHealthy(), QUASI_CLOSED, 990L);
+    addReplica(container, NodeStatus.inServiceHealthy(), UNHEALTHY, 980L);


Why should SCM not delete the unhealthy container in this case? It seems an unhealthy container with lower BCSID has no advantage over a quasi-closed container with higher BCSID. The PR description currently says

An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

You are right. But with unhealthy replicas, it is hard to determine if the replica can be recovered and if recovered, will the bcsId also change. That's why we thought of keeping the replica around if the closed container is lost.

But with the same argument, do we never delete unhealthy containers? I guess there is no absolutely right answer for this. The only thing we can be certain about is when we have a closed replica, it can be assumed to be the source of truth.

The PR description currently says

An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

The 2nd proposed fix in the PR description is what handles this case currently:

If all the replicas are unstable (either unhealthy or quasi-closed with lesser bcsId than container), then no replica should be deleted.

cc. @nandakumar131

If we think the unhealthy container's BCSID may be inaccurate then I am okay with keeping it in this scenario.

hanishakoneru · 2022-04-19T20:06:20Z

Thanks for the review Ethan.

Can we also add a test for this case described in the PR description?

Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated)

testAllUnstableReplicas is supposed to be for this case only when all replicas are unstable. I can update this test based on what we decide to do with unhealthy replica when closed replica is lost.

errose28 · 2022-08-15T18:59:19Z

The replication manager is currently being refactored to combine the LegacyReplicationManager and main ReplicationManager as part of EC. I'm closing this PR for now until the two classes are reconciled. After this proceeding with these changes should be easier.

See Jiras titled as EC: ReplicationManager under HDDS-6462 for a list of blockers.

HDDS-6447. Refine SCM handling of unhealthy container replicas

eef0e79

hanishakoneru requested a review from nandakumar131 March 30, 2022 22:40

Do not delete quasi-closed replica if there is no closed replica present

4ad7d2b

guihecheng reviewed Apr 6, 2022

View reviewed changes

errose28 reviewed Apr 8, 2022

View reviewed changes

Review comments + unit test

1f8ff63

avijayanhwx requested a review from sodonnel April 13, 2022 18:21

errose28 reviewed Apr 19, 2022

View reviewed changes

errose28 closed this Aug 15, 2022

errose28 mentioned this pull request Oct 31, 2022

HDDS-6447. Refine SCM handling of unhealthy container replicas. #3920

Merged

HDDS-6447. Refine SCM handling of unhealthy container replicas #3258

HDDS-6447. Refine SCM handling of unhealthy container replicas #3258

Uh oh!

Conversation

hanishakoneru commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng Apr 6, 2022

Choose a reason for hiding this comment

Uh oh!

hanishakoneru Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

hanishakoneru Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

hanishakoneru commented Apr 8, 2022

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

hanishakoneru Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

hanishakoneru commented Apr 19, 2022

Uh oh!

errose28 commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanishakoneru commented Mar 30, 2022 •

edited

Loading