-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-6447. Refine SCM handling of unhealthy container replicas. #3920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* master: (718 commits) HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852) HDDS-7413. Fix logging while marking container state unhealthy (apache#3887) Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)" HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877) HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746) HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778) HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873) HDDS-7253. Fix exception when '/' in key name (apache#3774) HDDS-7182. Add property to control RocksDB max open files (apache#3843) HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801) HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869) HDDS-7403. README Security Improvement (apache#3879) HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872) HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837) HDDS-7141. Recon: Improve Disk Usage Page (apache#3789) HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866) HDDS-6210. EC: Add EC metrics (apache#3851) HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859) HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861) HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java
- Add new tests - Group test by category
|
Hi @adoroszlai @sodonnel @nandakumar131 if you have time I would appreciate some feedback on the RM changes proposed in the description here before reviewing of the code starts. |
| .limit(requiredNodes) | ||
| .collect(Collectors.toList()); | ||
| deleteCandidates.removeAll(unhealthySorted); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only be QUASI_CLOSED containers, which cannot be force closed (not enough unique origin IDs), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code has changed since this review. Sorry I did not address it at the time, but I encourage you to check out the new version and post any remaining questions.
| List<ContainerReplica> unhealthySorted = | ||
| deleteCandidates.stream() | ||
| .sorted(Comparator.comparingLong( | ||
| ContainerReplica::getSequenceId)) | ||
| .limit(requiredNodes) | ||
| .collect(Collectors.toList()); | ||
| deleteCandidates.removeAll(unhealthySorted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this stream be sorted in ascending order and end up removing the lowest BCSIDs from candidates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This is now fixed and the comparator is reversed in the current code. See deleteExcessLowestBcsIDs
|
Looking at the suggestion for CLOSED container:
This is difficult / impossible for EC. For EC we need to reconstruct the container by reading the entire contents from a quorum of other containers and generating the missing data. If all containers are legitimately unhealthy, with some sort of read problem, we are going to get errors reading some of the blocks and be unable to reconstruct the data. It would likely be possible to do a partial reconstruction, but that comes with significant complexity too, as some block might be missing from one of the containers in the group, which would trip up the clients trying to read it. For now, what we have done with EC is say that an UNHEALTHY replica is much like a missing one. If we have one or more, we treat the container as under replicated and try to fix it. If we also exclude UNHEALTHY from the over-replication handling, and treat them as if they are already gone. That way, over replication will not remove them. Only if the container is neither over or under replicated will we remove the unhealthy replicas. The problem we are left with, is that if too many are unhealthy, we cannot do a reconstruction and hence cannot fix the problem. Then we will not replicate them either, as we cannot really do that, and it is possible to lose some of the unhealthy replicas over time due to disk failures etc. |
* master: (110 commits) HDDS-7472. EC: Fix NSSummaryEndpoint#getDiskUsage for EC keys (apache#3987) HDDS-5704. Ozone URI syntax description in help content needs to mention about ozone service id (apache#3862) HDDS-7555. Upgrade Ratis to 2.4.2-8b8bdda-SNAPSHOT. (apache#4028) HDDS-7541. FSO recursive delete directory with hierarchy takes much time for cleanup (apache#4008) HDDS-7581. Fix update-jar-report for snapshot (apache#4034) HDDS-7253. Fix exception when '/' in key name (apache#4038) HDDS-7579. Use Netty 4.1.77 for consistency (apache#4031) HDDS-7562. Suppress warning about long filenames in tar (apache#4017) HDDS-7563. Add a handler for under replicated Ratis containers in RM (apache#4025) HDDS-7497. Fix mkdir does not update bucket's usedNamespace (apache#3969) HDDS-7567. Invalid entries in LICENSE (apache#4020) HDDS-7575. Correct showing of RATIS-THREE icon in Recon UI (apache#4026) HDDS-7540. Let reusable workflow inherit secrets (apache#4012) HDDS-7568. Bump copyright year in NOTICE (apache#4018) HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric (apache#3878) HDDS-7510. Recon: Return number of open containers in `/clusterState` endpoint (apache#3989) HDDS-7561. Improve setquota, clrquota CLI usage (apache#4016) HDDS-6615. EC: Improve write performance by pipelining encode and flush (apache#3994) HDDS-7554. Recon UI should show DORMANT in pipeline status filter (apache#4010) HDDS-7540. Separate scheduled CI from push/PR workflows (apache#4004) ...
RatisContainerReplicaCount is changed to not count unhealthy containers towards its healthy count.
* HDDS-6447-take2: Test fixes Split unstable handler in legacy RM and split into common methods
siddhantsangwan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 I'm still reviewing LegacyReplicationManager but changes in other classes look okay.
| // Increment report stats. | ||
| if (!sufficientlyReplicated && replicaSet.isUnrecoverable()) { | ||
| report.incrementAndSample(HealthState.MISSING, | ||
| container.containerID()); | ||
| report.incrementAndSample( | ||
| HealthState.UNDER_REPLICATED, container.containerID()); | ||
| if (replicaSet.isUnrecoverable()) { | ||
| report.incrementAndSample(HealthState.MISSING, | ||
| container.containerID()); | ||
| } | ||
| } | ||
| if (!placementSatisfied) { | ||
| report.incrementAndSample(HealthState.MIS_REPLICATED, | ||
| container.containerID()); | ||
|
|
||
| } | ||
| // Replicate container if needed. | ||
| if (!inflightReplication.isFull() || !inflightDeletion.isFull()) { | ||
| handleUnderReplicatedContainer(container, | ||
| replicaSet, placementStatus); | ||
| if (!replicaSet.isUnrecoverable()) { | ||
| if (replicaSet.getHealthyReplicaCount() == 0 && | ||
| replicaSet.getUnhealthyReplicaCount() != 0) { | ||
| handleAllReplicasUnhealthy(container, replicaSet, | ||
| placementStatus, report); | ||
| } else { | ||
| report.incrementAndSample( | ||
| HealthState.UNDER_REPLICATED, container.containerID()); | ||
| handleUnderReplicatedHealthy(container, | ||
| replicaSet, placementStatus); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic will not update this container's under replicated state in the report if it's recoverable and those maps are full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I fixed this in fe2fd7e by having a dedicated method to generate the replication related parts of the report. Having them mixed with the handlers for these cases was messy.
|
A side effect of this change is that Containers such as {Container State: CLOSED, Replicas: CLOSED, CLOSING, CLOSING} would be called under replicated. I saw that the handler will try to replicate only if closing these replicas won't achieve sufficient replication. Do you think it can be confusing to call such a container under replicated? For EC in the new RM, we're treating UNHEALTHY replicas as "not there" since they're unavailable, CLOSED as candidates for replicating, and replicas in other states as available but cannot be replicated. In |
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
|
Thanks for the detailed review @siddhantsangwan. I finished all previously WIP tests and they are covering more cases around closing containers with unhealthy replicas and saving BCSIDs like you mentioned.
Yes even though the code is doing the right thing by trying to close them, we should probably not report these as under replicated. I will fix the report during my day tomorrow. |
Another related point is whether |
* master: (88 commits) HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. (apache#4093) HDDS-7683. EC: ReplicationManager - UnderRep maintenance handler should not request nodes if none needed (apache#4109) HDDS-7635. Update failure metrics when allocate block fails in preExecute. (apache#4086) HDDS-7565. FSO purge directory for old bucket can update quota for new bucket (apache#4021) HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue (apache#4099) HDDS-7621. Update SCM term in datanode from heartbeat without any commands (apache#4101) HDDS-7649. S3 multipart upload EC release space quota wrong for old version (apache#4095) HDDS-7399. Enable specifying external root ca (apache#4053) HDDS-7398. Tool to remove old certs from the scm db (apache#3972) HDDS-6650. S3MultipartUpload support update bucket usedNamespace. (apache#4081) HDDS-7605. Improve logging in Container Balancer (apache#4067) HDDS-7616. EC: Refactor Unhealthy Replicated Processor (apache#4063) HDDS-7426. Add a new acceptance test for Streaming Pipeline. (apache#4019) HDDS-7478. [Ozone-Streaming] NPE in when creating a file with o3fs. (apache#3949) HDDS-7425. Add documentation for the new Streaming Pipeline feature. (apache#3913) HDDS-7438. [Ozone-Streaming] Add a createStreamKey method to OzoneBucket. (apache#3914) HDDS-7431. [Ozone-Streaming] Disable data steam by default. (apache#3900) HDDS-6955. [Ozone-streaming] Add explicit stream flag in ozone shell (apache#3559) HDDS-6867. [Ozone-Streaming] PutKeyHandler should not use streaming to put EC key. (apache#3516) HDDS-6842. [Ozone-Streaming] Reduce the number of watch requests in StreamCommitWatcher. (apache#3492) ...
Right. Currently I am not counting the unhealthy replicas towards sufficient replication. This was done to minimize the changes to RatisContainerReplicaCount since this class is shared by the old and new RM. It seems this is causing test failures in the recently merged TestRatisOverReplicationHandler though. Ultimately I think what needs to happen is RatisContainerReplicaCount gets refactored to more explicitly deal with unhealthy replicas. However I don't think we should do this while two RM implementations are using the same class for different unhealthy replica handling. I think we should leave RatisContainerReplicaCount as is in this PR and ignore the failures in TestRatisOverReplicationHandler. When unhealthy replica handling is ported to the new RM and we are ready to turn off the old RM, we can refactor RatisContainerReplicaCount. |
* HDDS-6447-fix-stats: Finish report updates and minor test fixes Initial report updates marked
My concern is that it will affect all classes that use the |
This should preserve existing functionality for the new RM
|
@siddhantsangwan I managed to preserve the original functionality of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 The final changes look great. Thanks for working on this!
What changes were proposed in this pull request?
This PR continues Hanisha's work from #3258, although it makes some changes to the rules proposed there.
Updates how replication manager (RM) deals with quasi closed and unhealthy replicas for Ratis containers only. Currently all unhealthy containers are deleted. It is possible that the unhealthy container still has mostly good data, just with a few corrupted blocks, and that we will have the ability to recover unhealthy containers in the future. For this reason, this PR proposes changing the replication manager to abide by the following rules:
If the container is closed:
If the container is not yet closed:
What is the link to the Apache JIRA
HDDS-6447
How was this patch tested?
Test were added and updated in
TestLegacyReplicationManager. To aid in reviewing, tests in that class were grouped in to nested classes based on functionality. Tests in theUnstableReplicasclass concern these changes, and all should be reviewed for expected behavior even if they do not show up in the diff. Tests in other classes were relocated with slight modification as necessary. For reviewers of this file I would recommend verifying that all tests in the originalTestLegacyReplicationManagerare still present and passing, as the diff for this refactor is quite messy.