-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@GlenGeng , not fully get the reason for doing this. If we enable replication for open state containers, how we keep data consistent when container be in replication and meanwhile there is block written in this container. |
|
@GlenGeng , is the description of the pull request correct? |
|
@linyiqun @bshashikant Thanks for the review! Shashi, I found this issue when trying to debug the test case TestDeleteWithSlowFollower for SCM HA, which is written by you. You can find more details in the jira's description. The key point is, although this case can pass in master, after This case relies on RM to make the container to CLOSING state, so that SCM can move container state to CLOSED, and not hold the delete blocks command. There are two solutions to avoid the stale node issue, one is shorten the frequency of RM in the case TestDeleteWithSlowFollower, the other is the above changes. Actually, this fix would work in the version of Oct 20, which is the latest merge from master to HDDS-2823, But seems some change after Oct 20 affect the case TestDeleteWithSlowFollower again. |
|
The description may not be accurate enough, I would update them later. Thanks for pointing out ! |
|
@linyiqun Hey yiqun, never mind, RM will only send out replicate container command for closed container. |
Okay, get it. |
|
@ChenSammi Please take a look at this PR. Thanks ! |
|
The patch LGTM, +1. Thanks @GlenGeng for investigating and fix the issue, and @linyiqun @bshashikant for the review. |
d0dd364 to
ff1a047
Compare
* HDDS-3698-upgrade: HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551) HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588) HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625) HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626) HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618) HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620) HDDS-4512. Remove unused netty3 transitive dependency (apache#1627) HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608) HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621) HDDS-4471. GrpcOutputStream length can overflow (apache#1617) HDDS-4308. Fix issue with quota update (apache#1489) HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602) HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622) HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609) HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420) HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
* HDDS-3698-upgrade: HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551) HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588) HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625) HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626) HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618) HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620) HDDS-4512. Remove unused netty3 transitive dependency (apache#1627) HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608) HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621) HDDS-4471. GrpcOutputStream length can overflow (apache#1617) HDDS-4308. Fix issue with quota update (apache#1489) HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602) HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622) HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609) HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420) HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
* master: (40 commits) HDDS-4473. Reduce number of sortDatanodes RPC calls (apache#1610) HDDS-4485. [DOC] add the authentication rules of the Ozone Ranger. (apache#1603) HDDS-4528. Upgrade slf4j to 1.7.30 (apache#1639) HDDS-4424. Update README with information how to report security issues (apache#1548) HDDS-4484. Use RaftServerImpl isLeader instead of periodic leader update logic in OM and isLeaderReady for read/write requests (apache#1638) HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551) HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588) HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625) HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626) HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618) HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620) HDDS-4512. Remove unused netty3 transitive dependency (apache#1627) HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608) HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621) HDDS-4471. GrpcOutputStream length can overflow (apache#1617) HDDS-4308. Fix issue with quota update (apache#1489) HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602) HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622) HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609) HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420) ...
What changes were proposed in this pull request?
This change is inspired by avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower.
Consider the follow logic in RM,
Consider this scenario: client triggers SCM to allocate one container by allocating blocks, then it crashes, never writes chunks to DN to trigger the creation of the container, thus no replica report for this container.
Previously, ReplicationManager will close such containers, since it is under replicated. This is a reasonable and legacy handling which is used to prevents StaleNodeHandler to take effect in TestDeleteWithSlowFollower.
This following logic is added by Sammi in her PR
HDDS-4023. Delete closed container after all blocks have been deletedAfter talked with @ChenSammi , by design, it just needs to explicitly avoid replicating container in DELETING or DELETED state.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4511
How was this patch tested?
CI