Skip to content

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Apr 19, 2024

What changes were proposed in this pull request?

It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog will not retry at all, and the datanode will trigger a pipeline failure to close the pipeline. This might cause a lot of pipeline close events sent by the datanodes during high IO events. Our cluster encountered this issue which caused pipeline thrashing issues (pipeline kept getting closed and created continuously).

The issue was due to nodeFailureTimeoutMs initialized AFTER newRaftProperties and setStateMachineDataConfigurations which causes an issue. HDDS-9821 removed the overwritten configuration, but at the same time exposed the bug.

Need to fix the ordering so that it's the syncTimeoutRetry is calculated correctly (default 30 times).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10717

How was this patch tested?

Clean CI: https://github.com/ivandika3/ozone/actions/runs/8752866026

@ivandika3 ivandika3 self-assigned this Apr 19, 2024
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@adoroszlai adoroszlai merged commit 5dbd3cf into apache:master Apr 19, 2024
@adoroszlai
Copy link
Contributor

Thanks @ivandika3 for the patch, @szetszwo for the review.

@ivandika3
Copy link
Contributor Author

Thank you @szetszwo for the review and @adoroszlai for the merge.

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Apr 23, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants