HDDS-6447. Refine SCM handling of unhealthy container replicas. #3920

errose28 · 2022-10-31T17:58:47Z

What changes were proposed in this pull request?

This PR continues Hanisha's work from #3258, although it makes some changes to the rules proposed there.

Updates how replication manager (RM) deals with quasi closed and unhealthy replicas for Ratis containers only. Currently all unhealthy containers are deleted. It is possible that the unhealthy container still has mostly good data, just with a few corrupted blocks, and that we will have the ability to recover unhealthy containers in the future. For this reason, this PR proposes changing the replication manager to abide by the following rules:

If the container is closed:
- If all replicas are unhealthy, they should be replicated like healthy containers.
  - This means the system should prioritize having 3 copies available, and delete extra copies if over replication occurs.
  - The unhealthy replicas to keep should be prioritized on highest BCSID.
- If only some of the replicas are unhealthy:
  - In iteration 1, RM should replicate the healthy replicas and ignore the unhealthy replicas.
  - In iteration 2, RM should delete the unhealthy replicas.
If the container is not yet closed:
- Open containers remain excluded from RM.
- If all replicas are unhealthy:
  - 3 replicas should be preserved. These will be chosen based on unique BCSID.
- If there is a mix of unhealthy and quasi closed replicas:
  - In iteration 1, RM should replicate the quasi closed containers so that there are 3 replicas, ignoring the unhealthy replicas.
    - Since SCM currently has no way of knowing whether a replica is unhealthy due to a few block corruptions or complete container corruption/volume loss, unhealthy replicas should not count towards the container's durability.
  - In iteration 2, RM should delete the unhealthy containers whose BCSIDs are already represented in the healthy replicas.
    - In the future if unhealthy replicas can be recovered, there is potential to close these containers.

What is the link to the Apache JIRA

HDDS-6447

How was this patch tested?

Test were added and updated in TestLegacyReplicationManager. To aid in reviewing, tests in that class were grouped in to nested classes based on functionality. Tests in the UnstableReplicas class concern these changes, and all should be reviewed for expected behavior even if they do not show up in the diff. Tests in other classes were relocated with slight modification as necessary. For reviewers of this file I would recommend verifying that all tests in the original TestLegacyReplicationManager are still present and passing, as the diff for this refactor is quite messy.

* master: (718 commits) HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852) HDDS-7413. Fix logging while marking container state unhealthy (apache#3887) Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)" HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877) HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746) HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778) HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873) HDDS-7253. Fix exception when '/' in key name (apache#3774) HDDS-7182. Add property to control RocksDB max open files (apache#3843) HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801) HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869) HDDS-7403. README Security Improvement (apache#3879) HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872) HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837) HDDS-7141. Recon: Improve Disk Usage Page (apache#3789) HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866) HDDS-6210. EC: Add EC metrics (apache#3851) HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859) HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861) HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java

- Add new tests - Group test by category

errose28 · 2022-10-31T18:00:52Z

Hi @adoroszlai @sodonnel @nandakumar131 if you have time I would appreciate some feedback on the RM changes proposed in the description here before reviewing of the code starts.

sodonnel · 2022-11-08T16:04:15Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

+                .limit(requiredNodes)
+                .collect(Collectors.toList());
+        deleteCandidates.removeAll(unhealthySorted);
+      } else {


This can only be QUASI_CLOSED containers, which cannot be force closed (not enough unique origin IDs), right?

This part of the code has changed since this review. Sorry I did not address it at the time, but I encourage you to check out the new version and post any remaining questions.

siddhantsangwan · 2022-11-10T08:23:15Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

+        List<ContainerReplica> unhealthySorted =
+            deleteCandidates.stream()
+                .sorted(Comparator.comparingLong(
+                    ContainerReplica::getSequenceId))
+                .limit(requiredNodes)
+                .collect(Collectors.toList());
+        deleteCandidates.removeAll(unhealthySorted);


Would this stream be sorted in ascending order and end up removing the lowest BCSIDs from candidates?

Right. This is now fixed and the comparator is reversed in the current code. See deleteExcessLowestBcsIDs

sodonnel · 2022-11-16T16:19:53Z

Looking at the suggestion for CLOSED container:

If all replicas are unhealthy, they should be replicated like healthy containers.

This is difficult / impossible for EC. For EC we need to reconstruct the container by reading the entire contents from a quorum of other containers and generating the missing data. If all containers are legitimately unhealthy, with some sort of read problem, we are going to get errors reading some of the blocks and be unable to reconstruct the data. It would likely be possible to do a partial reconstruction, but that comes with significant complexity too, as some block might be missing from one of the containers in the group, which would trip up the clients trying to read it.

For now, what we have done with EC is say that an UNHEALTHY replica is much like a missing one. If we have one or more, we treat the container as under replicated and try to fix it. If we also exclude UNHEALTHY from the over-replication handling, and treat them as if they are already gone. That way, over replication will not remove them.

Only if the container is neither over or under replicated will we remove the unhealthy replicas.

The problem we are left with, is that if too many are unhealthy, we cannot do a reconstruction and hence cannot fix the problem. Then we will not replicate them either, as we cannot really do that, and it is possible to lose some of the unhealthy replicas over time due to disk failures etc.

* master: (110 commits) HDDS-7472. EC: Fix NSSummaryEndpoint#getDiskUsage for EC keys (apache#3987) HDDS-5704. Ozone URI syntax description in help content needs to mention about ozone service id (apache#3862) HDDS-7555. Upgrade Ratis to 2.4.2-8b8bdda-SNAPSHOT. (apache#4028) HDDS-7541. FSO recursive delete directory with hierarchy takes much time for cleanup (apache#4008) HDDS-7581. Fix update-jar-report for snapshot (apache#4034) HDDS-7253. Fix exception when '/' in key name (apache#4038) HDDS-7579. Use Netty 4.1.77 for consistency (apache#4031) HDDS-7562. Suppress warning about long filenames in tar (apache#4017) HDDS-7563. Add a handler for under replicated Ratis containers in RM (apache#4025) HDDS-7497. Fix mkdir does not update bucket's usedNamespace (apache#3969) HDDS-7567. Invalid entries in LICENSE (apache#4020) HDDS-7575. Correct showing of RATIS-THREE icon in Recon UI (apache#4026) HDDS-7540. Let reusable workflow inherit secrets (apache#4012) HDDS-7568. Bump copyright year in NOTICE (apache#4018) HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric (apache#3878) HDDS-7510. Recon: Return number of open containers in `/clusterState` endpoint (apache#3989) HDDS-7561. Improve setquota, clrquota CLI usage (apache#4016) HDDS-6615. EC: Improve write performance by pipelining encode and flush (apache#3994) HDDS-7554. Recon UI should show DORMANT in pipeline status filter (apache#4010) HDDS-7540. Separate scheduled CI from push/PR workflows (apache#4004) ...

RatisContainerReplicaCount is changed to not count unhealthy containers towards its healthy count.

* HDDS-6447-take2: Test fixes Split unstable handler in legacy RM and split into common methods

siddhantsangwan

@errose28 I'm still reviewing LegacyReplicationManager but changes in other classes look okay.

siddhantsangwan · 2022-12-15T19:37:41Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

+          // Increment report stats.
+          if (!sufficientlyReplicated && replicaSet.isUnrecoverable()) {
+            report.incrementAndSample(HealthState.MISSING,
+                container.containerID());
            report.incrementAndSample(
                HealthState.UNDER_REPLICATED, container.containerID());
-            if (replicaSet.isUnrecoverable()) {
-              report.incrementAndSample(HealthState.MISSING,
-                  container.containerID());
-            }
          }
          if (!placementSatisfied) {
            report.incrementAndSample(HealthState.MIS_REPLICATED,
                container.containerID());

          }
+          // Replicate container if needed.
          if (!inflightReplication.isFull() || !inflightDeletion.isFull()) {
-            handleUnderReplicatedContainer(container,
-                replicaSet, placementStatus);
+            if (!replicaSet.isUnrecoverable()) {
+              if (replicaSet.getHealthyReplicaCount() == 0 &&
+                  replicaSet.getUnhealthyReplicaCount() != 0) {
+                handleAllReplicasUnhealthy(container, replicaSet,
+                    placementStatus, report);
+              } else {
+                report.incrementAndSample(
+                    HealthState.UNDER_REPLICATED, container.containerID());
+                handleUnderReplicatedHealthy(container,
+                    replicaSet, placementStatus);
+              }
+            }


I think this logic will not update this container's under replicated state in the report if it's recoverable and those maps are full

Good catch. I fixed this in fe2fd7e by having a dedicated method to generate the replication related parts of the report. Having them mixed with the handlers for these cases was messy.

siddhantsangwan · 2022-12-16T10:50:57Z

A side effect of this change is that Containers such as {Container State: CLOSED, Replicas: CLOSED, CLOSING, CLOSING} would be called under replicated. I saw that the handler will try to replicate only if closing these replicas won't achieve sufficient replication. Do you think it can be confusing to call such a container under replicated?

For EC in the new RM, we're treating UNHEALTHY replicas as "not there" since they're unavailable, CLOSED as candidates for replicating, and replicas in other states as available but cannot be replicated.

In ECContainerReplicaCount:

      /*
      Remove UNHEALTHY replicas because they are unavailable. They could be a
      reason for under replication but should not be a reason for over
      replication.

      For example, consider the following set of replicas for an EC 3-2
      container:
      Replica Index 1: Closed
      Replica Index 2: Closed
      Replica Index 3: Closed, Unhealthy (2 replicas for this index)
      Replica Index 4: Unhealthy
      Replica Index 5: Closed

      This is a case of under replication because index 4 is unavailable. Index
      3 is not considered over replicated because its second copy is unhealthy.
      */

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

errose28 · 2022-12-20T05:14:47Z

Thanks for the detailed review @siddhantsangwan. I finished all previously WIP tests and they are covering more cases around closing containers with unhealthy replicas and saving BCSIDs like you mentioned.

A side effect of this change is that Containers such as {Container State: CLOSED, Replicas: CLOSED, CLOSING, CLOSING} would be called under replicated. I saw that the handler will try to replicate only if closing these replicas won't achieve sufficient replication. Do you think it can be confusing to call such a container under replicated?

Yes even though the code is doing the right thing by trying to close them, we should probably not report these as under replicated. I will fix the report during my day tomorrow.

siddhantsangwan · 2022-12-20T11:56:56Z

Yes even though the code is doing the right thing by trying to close them, we should probably not report these as under replicated. I will fix the report during my day tomorrow.

Another related point is whether RatisContainerReplicaCount should see all replicas that don't match with the container's state or only replicas that have UNHEALTHY state as "unhealthy". This affects the behaviour of isSufficientlyReplicated in that class.

* master: (88 commits) HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. (apache#4093) HDDS-7683. EC: ReplicationManager - UnderRep maintenance handler should not request nodes if none needed (apache#4109) HDDS-7635. Update failure metrics when allocate block fails in preExecute. (apache#4086) HDDS-7565. FSO purge directory for old bucket can update quota for new bucket (apache#4021) HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue (apache#4099) HDDS-7621. Update SCM term in datanode from heartbeat without any commands (apache#4101) HDDS-7649. S3 multipart upload EC release space quota wrong for old version (apache#4095) HDDS-7399. Enable specifying external root ca (apache#4053) HDDS-7398. Tool to remove old certs from the scm db (apache#3972) HDDS-6650. S3MultipartUpload support update bucket usedNamespace. (apache#4081) HDDS-7605. Improve logging in Container Balancer (apache#4067) HDDS-7616. EC: Refactor Unhealthy Replicated Processor (apache#4063) HDDS-7426. Add a new acceptance test for Streaming Pipeline. (apache#4019) HDDS-7478. [Ozone-Streaming] NPE in when creating a file with o3fs. (apache#3949) HDDS-7425. Add documentation for the new Streaming Pipeline feature. (apache#3913) HDDS-7438. [Ozone-Streaming] Add a createStreamKey method to OzoneBucket. (apache#3914) HDDS-7431. [Ozone-Streaming] Disable data steam by default. (apache#3900) HDDS-6955. [Ozone-streaming] Add explicit stream flag in ozone shell (apache#3559) HDDS-6867. [Ozone-Streaming] PutKeyHandler should not use streaming to put EC key. (apache#3516) HDDS-6842. [Ozone-Streaming] Reduce the number of watch requests in StreamCommitWatcher. (apache#3492) ...

errose28 · 2022-12-21T04:29:15Z

Another related point is whether RatisContainerReplicaCount should see all replicas that don't match with the container's state or only replicas that have UNHEALTHY state as "unhealthy". This affects the behaviour of isSufficientlyReplicated in that class.

Right. Currently I am not counting the unhealthy replicas towards sufficient replication. This was done to minimize the changes to RatisContainerReplicaCount since this class is shared by the old and new RM. It seems this is causing test failures in the recently merged TestRatisOverReplicationHandler though.

Ultimately I think what needs to happen is RatisContainerReplicaCount gets refactored to more explicitly deal with unhealthy replicas. However I don't think we should do this while two RM implementations are using the same class for different unhealthy replica handling. I think we should leave RatisContainerReplicaCount as is in this PR and ignore the failures in TestRatisOverReplicationHandler. When unhealthy replica handling is ported to the new RM and we are ready to turn off the old RM, we can refactor RatisContainerReplicaCount.

* HDDS-6447-fix-stats: Finish report updates and minor test fixes Initial report updates marked

siddhantsangwan · 2023-01-05T05:05:11Z

Right. Currently I am not counting the unhealthy replicas towards sufficient replication.

My concern is that it will affect all classes that use the isSufficientlyReplicated api. Another approach is to have a new method such as isSufficientlyReplicatedConsideringHealthy that only considers healthy replicas (as in replicas that match container state). This method can be used in legacy rm for the current changes.

…plicas

This should preserve existing functionality for the new RM

errose28 · 2023-01-06T20:43:26Z

@siddhantsangwan I managed to preserve the original functionality of RatisContainerReplicaCount by extending the class to make the minor change needed for the legacy RM in this PR. The child class can be deleted when the two RMs are reconciled. See changes in this range

siddhantsangwan

@errose28 The final changes look great. Thanks for working on this!

…he#3920) (cherry picked from commit 7229b35) Change-Id: Ia8b5f49d89e52c104d604c2e9970f51d1afb2f3e

hanishakoneru and others added 8 commits March 30, 2022 15:39

HDDS-6447. Refine SCM handling of unhealthy container replicas

eef0e79

Do not delete quasi-closed replica if there is no closed replica present

4ad7d2b

Review comments + unit test

1f8ff63

Minor clarifications

53388a2

Initial refactoring of tests

41ba313

- Add new tests - Group test by category

First draft of new handling in RM + minor test update

fb99d06

Checkstyle

b7d1299

sodonnel reviewed Nov 8, 2022

View reviewed changes

siddhantsangwan reviewed Nov 10, 2022

View reviewed changes

errose28 added 13 commits December 5, 2022 12:55

Another test fix + WIP notes

6dd95c5

RatisContainerReplicaCount is changed to not count unhealthy containers towards its healthy count.

Fixes to RM to better handle mis-replicatation and pass tests

8640e11

More minor test fixes

d5631f5

Split unstable handler in legacy RM and split into common methods

ca9aee8

Test fixes

0b87548

Merge branch 'HDDS-6447-take2' into HDDS-6447

d504ac6

* HDDS-6447-take2: Test fixes Split unstable handler in legacy RM and split into common methods

Fix topology tests

9af4a97

Combine rep source filtering and delete sorting

3ad2fb2

All LegacyReplicationManager unit tests passing except unstable ones

fedf68f

More test fixes

5ab6c4c

All legacy RM unit tests pass except for one incomplete test

2f5e295

Checkstyle

e4d1289

siddhantsangwan reviewed Dec 15, 2022

View reviewed changes

errose28 added 3 commits December 15, 2022 11:48

Don't send delete commands to non IN_SERVICE nodes

0e40137

Move replication report stats updates to their own method

fe2fd7e

Fix unique oriign node ID selector from old TODO

1545ce5

errose28 marked this pull request as ready for review December 16, 2022 06:07

siddhantsangwan reviewed Dec 19, 2022

View reviewed changes

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java Outdated Show resolved Hide resolved

errose28 added 2 commits December 19, 2022 21:04

Closing unhealthy replicas test and RM updates/fixes

913eca9

checkstyle

b15d4b2

errose28 added 3 commits December 20, 2022 18:35

Add method docs

9b1c028

Add more logs

7a040b3

errose28 added 3 commits December 21, 2022 20:55

Initial report updates marked

b9e0b34

Finish report updates and minor test fixes

2f1d316

Merge branch 'HDDS-6447-fix-stats' into HDDS-6447

b36ed76

* HDDS-6447-fix-stats: Finish report updates and minor test fixes Initial report updates marked

errose28 added 3 commits January 6, 2023 12:34

Switch RatisContainerReplicaCount to use method for geting healthy re…

b7b8c8d

…plicas

Implement adapter in RatisContainerReplicaCount for legacy RM

cf3dc09

This should preserve existing functionality for the new RM

Rat

85330db

siddhantsangwan approved these changes Jan 11, 2023

View reviewed changes

siddhantsangwan merged commit 7229b35 into apache:master Jan 11, 2023

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023

HDDS-6447. Refine SCM handling of unhealthy container replicas. (apac…

91e0c3f

…he#3920) (cherry picked from commit 7229b35) Change-Id: Ia8b5f49d89e52c104d604c2e9970f51d1afb2f3e

HDDS-6447. Refine SCM handling of unhealthy container replicas. #3920

HDDS-6447. Refine SCM handling of unhealthy container replicas. #3920

Uh oh!

Conversation

errose28 commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 commented Oct 31, 2022

Uh oh!

sodonnel Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Nov 16, 2022

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Dec 16, 2022

Uh oh!

Uh oh!

errose28 commented Dec 20, 2022

Uh oh!

siddhantsangwan commented Dec 20, 2022

Uh oh!

errose28 commented Dec 21, 2022

Uh oh!

siddhantsangwan commented Jan 5, 2023

Uh oh!

errose28 commented Jan 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

errose28 commented Oct 31, 2022 •

edited

Loading

errose28 commented Jan 6, 2023 •

edited

Loading

siddhantsangwan left a comment •

edited

Loading