HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. #3523

guihecheng · 2022-06-16T09:06:04Z

Co-authored-by: Uma Maheswara Rao G [email protected]

What changes were proposed in this pull request?

Fix potential wrong replica read with over-replicated container.
The problem is due to that an EC pipeline with multiple DNs is returned and used with the first DN picked while refreshing pipeline on read error(Read error is due to concurrent redundant EC container replica deletion, e.g. EC 10+4 with 18 replicas),
but the right behavior should be pick the DN with the same replicaIndex and build a standalone pipeline for retry.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6890

How was this patch tested?

To be verified manually.

…container. Co-authored-by: Uma Maheswara Rao G <[email protected]>

guihecheng · 2022-06-16T09:06:24Z

cc @umamaheswararao @sodonnel for review.

sodonnel

I think this change looks good. I think it will fix the issue.

A test to reproduce the problem is likely too difficult, but I'd like to see a test that executes the new refresh wrapper.

Do you think we could create a protected method called something like:

Function<BlockID, Pipeline> ecPipelineRefreshFunction(BlockId blockID, int replicaIndex, Function<BlockID, Pipeline> refreshFunction) {
...
}

Then we can create a simple unit test that creates a simple refresh function that returns an hardcoded ECPipeline object, that is used by our new wrapper, and validate the correct node is selected based on the index?

guihecheng · 2022-06-16T13:15:48Z

I think this change looks good. I think it will fix the issue.

A test to reproduce the problem is likely too difficult, but I'd like to see a test that executes the new refresh wrapper.

Do you think we could create a protected method called something like:
Function<BlockID, Pipeline> ecPipelineRefreshFunction(BlockId blockID, int replicaIndex, Function<BlockID, Pipeline> refreshFunction) {
...
}
Then we can create a simple unit test that creates a simple refresh function that returns an hardcoded ECPipeline object, that is used by our new wrapper, and validate the correct node is selected based on the index?

Oh, that's a good idea to cover this small function with test, Thanks, I'll update soon.

sodonnel

LGTM. Thanks for adding the unit test.

…container. (apache#3523) Co-authored-by: Uma Maheswara Rao G <[email protected]>

* master: (34 commits) HDDS-6868 Add S3Auth information to thread local (apache#3527) HDDS-6877. Keep replication port unchanged when restarting datanode in MiniOzoneCluster (apache#3510) HDDS-6907. OFS should create buckets with FILE_SYSTEM_OPTIMIZED layout. (apache#3528) HDDS-6875. Migrate parameterized tests in hdds-common to JUnit5 (apache#3513) HDDS-6924. OBJECT_STORE isn't flat namespaced (apache#3533) HDDS-6899. [EC] Remove warnings and errors from console during online reconstruction of data. (apache#3522) HDDS-6695. Enable SCM Ratis by default for new clusters only (apache#3499) HDDS-4123. Integrate OM Open Key Cleanup Service Into Existing Code (apache#3319) HDDS-6882. Correct exit code for invalid arguments passed to command-line tools. (apache#3517) HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. (apache#3523) HDDS-6902. Duplicate mockito-core entries in pom.xml (apache#3525) HDDS-6752. Migrate tests with rules in hdds-server-scm to JUnit5 (apache#3442) HDDS-6806. EC: Implement the EC Reconstruction coordinator. (apache#3504) HDDS-6829. Limit the no of inflight replication tasks in SCM. (apache#3482) HDDS-6898. [SCM HA finalization] Modify acceptance test configuration to speed up test finalization (apache#3521) HDDS-6577. Configurations to reserve HDDS volume space. (apache#3484) HDDS-6870 Clean up isTenantAdmin to use UGI (apache#3503) HDDS-6872. TestAuthorizationV4QueryParser should pass offline (apache#3506) HDDS-6840. Add MetaData volume information to the SCM and OM - UI (apache#3488) HDDS-6697. EC: ReplicationManager - create class to detect EC container health issues (apache#3512) ...

… with over-replicated container. HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. (apache#3523) Co-authored-by: Uma Maheswara Rao G <[email protected]> (cherry picked from commit 5d6ce84) Change-Id: Ic4eb8cdc6c0a618b8ae06f6816b2f181ef969dd3

HDDS-6890. EC: Fix potential wrong replica read with over-replicated …

d6da0f5

…container. Co-authored-by: Uma Maheswara Rao G <[email protected]>

sodonnel reviewed Jun 16, 2022

View reviewed changes

markgui added 2 commits June 17, 2022 11:16

Extract lambda into a member method and add ut for it.

5db339f

Remove unintended remove of new line.

9e663d1

sodonnel approved these changes Jun 17, 2022

View reviewed changes

umamaheswararao merged commit 5d6ce84 into apache:master Jun 17, 2022

guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Jun 20, 2022

HDDS-6890. EC: Fix potential wrong replica read with over-replicated …

0475306

…container. (apache#3523) Co-authored-by: Uma Maheswara Rao G <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. #3523

HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. #3523

Uh oh!

guihecheng commented Jun 16, 2022

Uh oh!

guihecheng commented Jun 16, 2022

Uh oh!

sodonnel left a comment •

edited

Loading

Uh oh!

guihecheng commented Jun 16, 2022

Uh oh!

sodonnel left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. #3523

HDDS-6890. EC: Fix potential wrong replica read with over-replicated container. #3523

Uh oh!

Conversation

guihecheng commented Jun 16, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng commented Jun 16, 2022

Uh oh!

sodonnel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng commented Jun 16, 2022

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sodonnel left a comment •

edited

Loading