-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…struction failed at coordinator.
…struction failed at coordinator.
JacksonYao287
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @umamaheswararao for the patch , the changes overall looks good.
so now, we have to ways to cleanup stale recovering containers.
1 if coordinator encounters any exception when reconstructing the EC container group, it will try to delete the all the recovering containers. this is just the work of this patch.
2 when datanode startup, it will scan all the containers and delete the recovering ones. this will be done in HDDS-6978
so , a question, what if the coordinator crashes when trying to delete the recovering containers in other target datanode and those target datanode never restarts? there will be no trigger to complete this task.
should we have some other approaches to make sure that those recovering container will be deleted ultimately? maybe we can assign this work to containerDataScanner, then another question comes, for a certain recovering container, how the datanode itself know this container is being written by coordinator, or just a stale recovering container with a lost coordinator.
|
HI @JacksonYao287, Thank you for the comments. I think we can go with some longer timeout. Let's 20mins? If reconstruction does not happen in 20mins, I would consider that reconstruction tasks is facing some really slow issues or intermittently failed. Let't keep that as dev tunable param and as we learn the time needed, we may update it accordingly. But Let's not expose that to users to avoid some too much internal configs. |
JacksonYao287
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1!
|
Oops, I had a test for this one but forgot to add in this PR. Anyway, since we got +1, I moved ahead to commit and filed the followup for adding the test. HDDS-6989 Thanks a lot @JacksonYao287 for the reviews! |
* master: (46 commits) HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532) HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585) HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578) HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514) HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564) HDDS-6938. handle NPE when removing prefixAcl (apache#3568) HDDS-6960. EC: Implement the Over-replication Handler (apache#3572) HDDS-6979. Remove unused plexus dependency declaration (apache#3579) HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574) HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400) HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553) HDDS-6965. Increase timeout for basic check (apache#3563) HDDS-6969. Add link to compose directory in smoketest README (apache#3567) HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573) HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575) HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012) HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle HDDS-6773. Cleanup TestRDBTableStore (apache#3434) HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) (cherry picked from commit d088d19) Change-Id: I8546c1ad4088f7ddfde3257d78342396146c4d2e
What changes were proposed in this pull request?
Coordinator created recovering containers would be deleted if there is failure.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6982
How was this patch tested?
Will be adding tests.