Skip to content

Conversation

@umamaheswararao
Copy link
Contributor

What changes were proposed in this pull request?

Coordinator created recovering containers would be deleted if there is failure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6982

How was this patch tested?

Will be adding tests.

@umamaheswararao umamaheswararao changed the title HDDS-6982: EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. Jul 7, 2022
Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @umamaheswararao for the patch , the changes overall looks good.
so now, we have to ways to cleanup stale recovering containers.
1 if coordinator encounters any exception when reconstructing the EC container group, it will try to delete the all the recovering containers. this is just the work of this patch.
2 when datanode startup, it will scan all the containers and delete the recovering ones. this will be done in HDDS-6978

so , a question, what if the coordinator crashes when trying to delete the recovering containers in other target datanode and those target datanode never restarts? there will be no trigger to complete this task.

should we have some other approaches to make sure that those recovering container will be deleted ultimately? maybe we can assign this work to containerDataScanner, then another question comes, for a certain recovering container, how the datanode itself know this container is being written by coordinator, or just a stale recovering container with a lost coordinator.

@umamaheswararao
Copy link
Contributor Author

HI @JacksonYao287, Thank you for the comments.
Yeah, we have discussed about this point. I have just filed a JIRA for the same https://issues.apache.org/jira/browse/HDDS-6987 (please see the description in this JIRA).

I think we can go with some longer timeout. Let's 20mins? If reconstruction does not happen in 20mins, I would consider that reconstruction tasks is facing some really slow issues or intermittently failed. Let't keep that as dev tunable param and as we learn the time needed, we may update it accordingly. But Let's not expose that to users to avoid some too much internal configs.

Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1!

@umamaheswararao umamaheswararao merged commit d088d19 into apache:master Jul 10, 2022
@umamaheswararao
Copy link
Contributor Author

Oops, I had a test for this one but forgot to add in this PR. Anyway, since we got +1, I moved ahead to commit and filed the followup for adding the test. HDDS-6989

Thanks a lot @JacksonYao287 for the reviews!

errose28 added a commit to errose28/ozone that referenced this pull request Jul 12, 2022
* master: (46 commits)
  HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532)
  HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585)
  HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583)
  HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578)
  HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514)
  HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564)
  HDDS-6938. handle NPE when removing prefixAcl (apache#3568)
  HDDS-6960. EC: Implement the Over-replication Handler (apache#3572)
  HDDS-6979. Remove unused plexus dependency declaration (apache#3579)
  HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574)
  HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400)
  HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553)
  HDDS-6965. Increase timeout for basic check (apache#3563)
  HDDS-6969. Add link to compose directory in smoketest README (apache#3567)
  HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573)
  HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575)
  HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012)
  HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle
  HDDS-6773. Cleanup TestRDBTableStore (apache#3434)
  HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371)
  ...

Conflicts:
    hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
duongkame pushed a commit to duongkame/ozone that referenced this pull request Aug 16, 2022
HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583)

(cherry picked from commit d088d19)
Change-Id: I8546c1ad4088f7ddfde3257d78342396146c4d2e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants