HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTimeoutRetry #6560

ivandika3 · 2024-04-19T13:53:51Z

What changes were proposed in this pull request?

It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog will not retry at all, and the datanode will trigger a pipeline failure to close the pipeline. This might cause a lot of pipeline close events sent by the datanodes during high IO events. Our cluster encountered this issue which caused pipeline thrashing issues (pipeline kept getting closed and created continuously).

The issue was due to nodeFailureTimeoutMs initialized AFTER newRaftProperties and setStateMachineDataConfigurations which causes an issue. HDDS-9821 removed the overwritten configuration, but at the same time exposed the bug.

Need to fix the ordering so that it's the syncTimeoutRetry is calculated correctly (default 30 times).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10717

How was this patch tested?

Clean CI: https://github.com/ivandika3/ozone/actions/runs/8752866026

…eoutRetry

szetszwo

+1 the change looks good.

adoroszlai · 2024-04-19T15:58:07Z

Thanks @ivandika3 for the patch, @szetszwo for the review.

ivandika3 · 2024-04-20T02:15:04Z

Thank you @szetszwo for the review and @adoroszlai for the merge.

…eoutRetry (apache#6560) (cherry picked from commit 5dbd3cf)

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTim…

f6faa2e

…eoutRetry

ivandika3 self-assigned this Apr 19, 2024

ivandika3 requested review from adoroszlai and szetszwo April 19, 2024 13:57

szetszwo approved these changes Apr 19, 2024

View reviewed changes

adoroszlai merged commit 5dbd3cf into apache:master Apr 19, 2024

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Apr 23, 2024

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTim…

c6f9084

…eoutRetry (apache#6560) (cherry picked from commit 5dbd3cf)

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTim…

935bb22

…eoutRetry (apache#6560) (cherry picked from commit 5dbd3cf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTimeoutRetry #6560

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTimeoutRetry #6560

Uh oh!

ivandika3 commented Apr 19, 2024 •

edited

Loading

Uh oh!

szetszwo left a comment

Uh oh!

adoroszlai commented Apr 19, 2024

Uh oh!

ivandika3 commented Apr 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTimeoutRetry #6560

HDDS-10717. nodeFailureTimeoutMs should be initialized before syncTimeoutRetry #6560

Uh oh!

Conversation

ivandika3 commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 19, 2024

Uh oh!

ivandika3 commented Apr 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivandika3 commented Apr 19, 2024 •

edited

Loading