HDDS-6806. EC: Implement the EC Reconstruction coordinator. #3504

umamaheswararao · 2022-06-10T14:23:24Z

What changes were proposed in this pull request?

This patch implements the EC REconstruction coordinator functionality. With this patch, DN would be capable of reconstructing given missing indexes to given targets.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6806

How was this patch tested?

Added tests.

… block indexes reconstruction

guihecheng · 2022-06-13T13:05:54Z

Hi Uma, by reading through the main flow, I think the implementation is good, I'll go through the details later, Thanks~

...ain/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECContainerOperationClient.java

guihecheng · 2022-06-14T13:03:31Z

Hi Uma, I've got 3 questions generally:

We are going to treat the local target recovery just like remote target recovery, right?
Do we plan to do cleanup on failures(e.g. writeChunk fails we may want to delete all recovered chunk files) in this patch or in a new one?
Do we plan to do concurrent listBlock/CreateRecoveryingContainer/CloseContainer in this patch or in a new one?

umamaheswararao · 2022-06-14T13:48:24Z

HI @guihecheng thanks a lot for review and questions.
1 : Yes, for now we just treat everything as same. Local or remote. We can do local optimizations later.
2 : I think we can do that separately. Cleanup is not gauranteed always. So, we need some leasing kind of mechanism for avoiding collisions with new one. We can implement DN self cleaning after some long timeout.
3 : Yes, we will do that. In the initial cut, for stability purposes, we can just do sequential. When we do parallel, we need to consider other factors like CPU etc. May be we should do with small thread pool size. But I feel we will investigate that in next patches.
The main goal here is get the initial working version.

guihecheng · 2022-06-15T01:50:16Z

HI @guihecheng thanks a lot for review and questions. 1 : Yes, for now we just treat everything as same. Local or remote. We can do local optimizations later. 2 : I think we can do that separately. Cleanup is not gauranteed always. So, we need some leasing kind of mechanism for avoiding collisions with new one. We can implement DN self cleaning after some long timeout. 3 : Yes, we will do that. In the initial cut, for stability purposes, we can just do sequential. When we do parallel, we need to consider other factors like CPU etc. May be we should do with small thread pool size. But I feel we will investigate that in next patches. The main goal here is get the initial working version.

Thanks for the replies, I agrees with the points.

guihecheng · 2022-06-15T08:33:26Z

LGTM+1

adoroszlai

Thanks @umamaheswararao and @guihecheng for the patch. I have a few comments, mostly nits.

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientManager.java

...in/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECReconstructionCoordinator.java

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java

...in/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECReconstructionCoordinator.java

...tegration-test/src/test/java/org/apache/hadoop/hdds/scm/storage/TestContainerCommandsEC.java

...in/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECReconstructionCoordinator.java

adoroszlai · 2022-06-16T14:15:45Z

Thanks @umamaheswararao for updating the patch. There are two minor items left, but otherwise LGTM.

umamaheswararao · 2022-06-16T14:47:43Z

Thanks @adoroszlai for taking a look. I have just corrected them. Thanks

adoroszlai · 2022-06-16T16:16:56Z

I forgot to add @guihecheng as co-author in the commit message, sorry about that.

umamaheswararao · 2022-06-16T16:53:19Z

we have just added:

Co-authored-by: Gui Hecheng <[email protected]>

Thanks a lot @guihecheng and @adoroszlai for reviews !!!
It's a good step forward for OfflineRecovery. Thanks for all the help from you.

Co-authored-by: Gui Hecheng <[email protected]>

* master: (34 commits) HDDS-6868 Add S3Auth information to thread local (apache#3527) HDDS-6877. Keep replication port unchanged when restarting datanode in MiniOzoneCluster (apache#3510) HDDS-6907. OFS should create buckets with FILE_SYSTEM_OPTIMIZED layout. (apache#3528) HDDS-6875. Migrate parameterized tests in hdds-common to JUnit5 (apache#3513) HDDS-6924. OBJECT_STORE isn't flat namespaced (apache#3533) HDDS-6899. [EC] Remove warnings and errors from console during online reconstruction of data. (apache#3522) HDDS-6695. Enable SCM Ratis by default for new clusters only (apache#3499) HDDS-4123. Integrate OM Open Key Cleanup Service Into Existing Code (apache#3319) HDDS-6882. Correct exit code for invalid arguments passed to command-line tools. (apache#3517) HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. (apache#3523) HDDS-6902. Duplicate mockito-core entries in pom.xml (apache#3525) HDDS-6752. Migrate tests with rules in hdds-server-scm to JUnit5 (apache#3442) HDDS-6806. EC: Implement the EC Reconstruction coordinator. (apache#3504) HDDS-6829. Limit the no of inflight replication tasks in SCM. (apache#3482) HDDS-6898. [SCM HA finalization] Modify acceptance test configuration to speed up test finalization (apache#3521) HDDS-6577. Configurations to reserve HDDS volume space. (apache#3484) HDDS-6870 Clean up isTenantAdmin to use UGI (apache#3503) HDDS-6872. TestAuthorizationV4QueryParser should pass offline (apache#3506) HDDS-6840. Add MetaData volume information to the SCM and OM - UI (apache#3488) HDDS-6697. EC: ReplicationManager - create class to detect EC container health issues (apache#3512) ...

…pache#3504)" This reverts commit f57a019.

HDDS-6806. EC: Implement the EC Reconstruction coordinator. (apache#3504) Co-authored-by: Gui Hecheng <[email protected]> (cherry picked from commit f57a019) Change-Id: I77e71bbbc2286e699def332e2ae9d8c862df39d1

umamaheswararao added 4 commits June 12, 2022 22:25

HDDS-6806: EC: Implement the EC Reconstruction coordinator

3bc2612

Adding the ECCoordinator implementation class.

c6eafa9

Fixed a warning.

edb0dde

Got few changes from apache#3467 and added few tests to cover padding…

2c85a35

… block indexes reconstruction

umamaheswararao force-pushed the HDDS-6806 branch from 615f546 to 2c85a35 Compare June 13, 2022 05:26

umamaheswararao added 4 commits June 13, 2022 09:51

To make SCCleintConfig inited. TODO: revert this commit

cece6b2

Added XceiverClientManager config

99e3b1f

Some additional cleanups to simplify the logic.

beeb451

Passing the certificate client to XceiverClientManager

4407fcc

guihecheng reviewed Jun 14, 2022

View reviewed changes

...ain/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECContainerOperationClient.java Show resolved Hide resolved

Added replica index in createRecoveringContainer

253de95

guihecheng approved these changes Jun 15, 2022

View reviewed changes

adoroszlai requested changes Jun 15, 2022

View reviewed changes

Fixed the review comments

eecddbb

adoroszlai reviewed Jun 16, 2022

View reviewed changes

...in/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECReconstructionCoordinator.java Outdated Show resolved Hide resolved

Fixed few left overs

e2ea67b

adoroszlai approved these changes Jun 16, 2022

View reviewed changes

adoroszlai merged this pull request into apache:master Jun 16, 2022

umamaheswararao added a commit that referenced this pull request Jun 16, 2022

HDDS-6806. EC: Implement the EC Reconstruction coordinator. (#3504)

f57a019

Co-authored-by: Gui Hecheng <[email protected]>

guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Jun 28, 2022

Revert "HDDS-6806. EC: Implement the EC Reconstruction coordinator. (a…

a917337

…pache#3504)" This reverts commit f57a019.

HDDS-6806. EC: Implement the EC Reconstruction coordinator. #3504

HDDS-6806. EC: Implement the EC Reconstruction coordinator. #3504

Uh oh!

Conversation

umamaheswararao commented Jun 10, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng commented Jun 13, 2022

Uh oh!

Uh oh!

guihecheng commented Jun 14, 2022

Uh oh!

umamaheswararao commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guihecheng commented Jun 15, 2022

Uh oh!

guihecheng commented Jun 15, 2022

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adoroszlai commented Jun 16, 2022

Uh oh!

umamaheswararao commented Jun 16, 2022

Uh oh!

adoroszlai commented Jun 16, 2022

Uh oh!

umamaheswararao commented Jun 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

umamaheswararao commented Jun 14, 2022 •

edited

Loading