HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583

umamaheswararao · 2022-07-07T03:24:19Z

What changes were proposed in this pull request?

Coordinator created recovering containers would be deleted if there is failure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6982

How was this patch tested?

Will be adding tests.

…struction failed at coordinator.

JacksonYao287

thanks @umamaheswararao for the patch , the changes overall looks good.
so now, we have to ways to cleanup stale recovering containers.
1 if coordinator encounters any exception when reconstructing the EC container group, it will try to delete the all the recovering containers. this is just the work of this patch.
2 when datanode startup, it will scan all the containers and delete the recovering ones. this will be done in HDDS-6978

so , a question, what if the coordinator crashes when trying to delete the recovering containers in other target datanode and those target datanode never restarts? there will be no trigger to complete this task.

should we have some other approaches to make sure that those recovering container will be deleted ultimately? maybe we can assign this work to containerDataScanner, then another question comes, for a certain recovering container, how the datanode itself know this container is being written by coordinator, or just a stale recovering container with a lost coordinator.

umamaheswararao · 2022-07-07T21:25:44Z

HI @JacksonYao287, Thank you for the comments.
Yeah, we have discussed about this point. I have just filed a JIRA for the same https://issues.apache.org/jira/browse/HDDS-6987 (please see the description in this JIRA).

I think we can go with some longer timeout. Let's 20mins? If reconstruction does not happen in 20mins, I would consider that reconstruction tasks is facing some really slow issues or intermittently failed. Let't keep that as dev tunable param and as we learn the time needed, we may update it accordingly. But Let's not expose that to users to avoid some too much internal configs.

JacksonYao287

LGTM +1！

umamaheswararao · 2022-07-10T04:50:06Z

Oops, I had a test for this one but forgot to add in this PR. Anyway, since we got +1, I moved ahead to commit and filed the followup for adding the test. HDDS-6989

Thanks a lot @JacksonYao287 for the reviews!

* master: (46 commits) HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532) HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585) HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578) HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514) HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564) HDDS-6938. handle NPE when removing prefixAcl (apache#3568) HDDS-6960. EC: Implement the Over-replication Handler (apache#3572) HDDS-6979. Remove unused plexus dependency declaration (apache#3579) HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574) HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400) HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553) HDDS-6965. Increase timeout for basic check (apache#3563) HDDS-6969. Add link to compose directory in smoketest README (apache#3567) HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573) HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575) HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012) HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle HDDS-6773. Cleanup TestRDBTableStore (apache#3434) HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java

HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) (cherry picked from commit d088d19) Change-Id: I8546c1ad4088f7ddfde3257d78342396146c4d2e

umamaheswararao added 3 commits July 5, 2022 15:10

HDDS-6982: EC: Attempt to cleanup the RECOVERING container when recon…

7bca034

…struction failed at coordinator.

HDDS-6982: EC: Attempt to cleanup the RECOVERING container when recon…

c1ba10a

…struction failed at coordinator.

Making sure to delete only coordinator created recovering containers.

21ace97

umamaheswararao changed the title ~~HDDS-6982: EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator.~~ HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. Jul 7, 2022

JacksonYao287 reviewed Jul 7, 2022

View reviewed changes

JacksonYao287 approved these changes Jul 9, 2022

View reviewed changes

umamaheswararao merged commit d088d19 into apache:master Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583

HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583

Uh oh!

umamaheswararao commented Jul 7, 2022

Uh oh!

JacksonYao287 left a comment

Uh oh!

umamaheswararao commented Jul 7, 2022

Uh oh!

JacksonYao287 left a comment

Uh oh!

umamaheswararao commented Jul 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583

HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. #3583

Uh oh!

Conversation

umamaheswararao commented Jul 7, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

JacksonYao287 left a comment

Choose a reason for hiding this comment

Uh oh!

umamaheswararao commented Jul 7, 2022

Uh oh!

JacksonYao287 left a comment

Choose a reason for hiding this comment

Uh oh!

umamaheswararao commented Jul 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants