Skip to content

Conversation

@hanishakoneru
Copy link
Contributor

@hanishakoneru hanishakoneru commented Mar 30, 2022

What changes were proposed in this pull request?

Currently, containers are marked UNHEALTHY by Container Scrubber for one of the following reasons:

  • If an operation fails on an open/ closing container, it is marked unhealthy so that subsequent write transactions also fail.

  • If Container Scrubber is enabled and ContainerMetadataScanner detects an error during KeyValueContainerCheck#fastCheck().

    • Metadata path or Chunks path is not accessible as a directory
    • Container checksum verification fails
    • On-disk Container Yaml data does not match in-memory container data (ContainerType, ContainerID, Container DBType, Metadata Path)
  • If Container Scrubber is enabled and ContainerDataScanner (runs only on closed and quasi-closed containers) detects any block with missing or corrupted chunks file.

If a container in “open” state in SCM is marked unhealthy (in the container report), SCM asks the DNs to close the container. But for a “closing” container with an “unhealthy” replica, SCM leaves the container replica as is.

Some of the issues with how unhealthy containers are handled:

  1. If ReplicationManager does not find a healthy replica for a container, it does not replicate that container. So if there is only one replica of a container and it is unhealthy, SCM will never replicate it and there is potential for data loss if that single replica is lost for any reason (for example: disk failure).
  2. If there is a Quasi-Closed replica and an Unhealthy container, SCM will delete the unhealthy container. SCM does not check if the unhealthy replica has higher bcsId.
  3. Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated). RelicationManager will delete one of these quesi-closed replicas (handleUnstableContainer) and then in the next cycle replicate it again as container would now be under-replicated (handleUnderreplicatedContainer). This will become a loop of replicating and deleting the container replica.

SCM should be more conservative with deleting unhealthy containers as they could possibly be recovered. This Jira proposes to

  1. Let SCM replicate an unhealthy container if there is no other healthy replicas.
  2. If all the replicas are unstable (either unhealthy or quasi-closed with lesser bcsId than container), then no replica should be deleted.
  3. An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6447

How was this patch tested?

Add tests in TestReplicationManager

final List<ContainerReplica> unhealthyReplicas = eligibleReplicas
.stream()
.filter(r -> !compareState(container.getState(), r.getState()))
.filter(r -> r.getSequenceId() > container.getSequenceId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a typo, should be r.getSequenceId() < container.getSequenceId() as the comments stated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks @guihecheng for catching this.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @hanishakoneru. I think it would be very helpful to add tests to TestReplicationManager for these cases. We could add test cases for each of these scenarios involving unhealthy/quasi-closed replicas with different BCSIDs to enforce and document expected replication manager behavior in these situations.


// If there is only 1 replica of a container remaining, replicate it
// even if it is unhealthy.
if (source.size() == 0 && replicas.size() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to preserve unhealthy containers, shouldn't we replicate them if they are at all under replicated, not just the last remaining one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to replicate even unhealthy containers if closed-quasi closed are not available.

@hanishakoneru
Copy link
Contributor Author

Thanks @guihecheng and @errose28 for the reviews. I addressed the review comments and added some unit tests.

@avijayanhwx avijayanhwx requested a review from sodonnel April 13, 2022 18:21
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests @hanishakoneru. Can we also add a test for this case described in the PR description?

Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated). RelicationManager will delete one of these quesi-closed replicas (handleUnstableContainer) and then in the next cycle replicate it again as container would now be under-replicated (handleUnderreplicatedContainer). This will become a loop of replicating and deleting the container replica.

final ContainerInfo container = createContainer(LifeCycleState.CLOSED);
addReplica(container, NodeStatus.inServiceHealthy(), QUASI_CLOSED, 990L);
addReplica(container, NodeStatus.inServiceHealthy(), QUASI_CLOSED, 990L);
addReplica(container, NodeStatus.inServiceHealthy(), UNHEALTHY, 980L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should SCM not delete the unhealthy container in this case? It seems an unhealthy container with lower BCSID has no advantage over a quasi-closed container with higher BCSID. The PR description currently says

An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. But with unhealthy replicas, it is hard to determine if the replica can be recovered and if recovered, will the bcsId also change. That's why we thought of keeping the replica around if the closed container is lost.

But with the same argument, do we never delete unhealthy containers? I guess there is no absolutely right answer for this. The only thing we can be certain about is when we have a closed replica, it can be assumed to be the source of truth.

The PR description currently says

An unhealthy replica should be deleted only if it's bcsId is < than all quasi-closed and closed replicas bcsId.

The 2nd proposed fix in the PR description is what handles this case currently:

If all the replicas are unstable (either unhealthy or quasi-closed with lesser bcsId than container), then no replica should be deleted.

cc. @nandakumar131

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think the unhealthy container's BCSID may be inaccurate then I am okay with keeping it in this scenario.

@hanishakoneru
Copy link
Contributor Author

Thanks for the review Ethan.

Can we also add a test for this case described in the PR description?

Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated)

testAllUnstableReplicas is supposed to be for this case only when all replicas are unstable. I can update this test based on what we decide to do with unhealthy replica when closed replica is lost.

@errose28
Copy link
Contributor

The replication manager is currently being refactored to combine the LegacyReplicationManager and main ReplicationManager as part of EC. I'm closing this PR for now until the two classes are reconciled. After this proceeding with these changes should be easier.

See Jiras titled as EC: ReplicationManager under HDDS-6462 for a list of blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants