-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-8024. When readChunk from a datanode fails, retry other datanodes. #4336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is a good fix and looks good to me!
Prior to this change, BlockInputStream defines retry policy which retries three times with a 1 ms delay before retry.
After this change, reading a block could retry up to 9 times (assuming 3 replicas): r1, r2, r3, r1, r2, r3, r1, r2, r3.
It might be a good idea to extend the delay between retrying r3 and r1, in case of cluster-wide restart. In fact, we should consider testing such cases so that clients don't fail prematurely.
Also for the case where block token expires (https://issues.apache.org/jira/browse/HDDS-7930) trying another replica will not help, although the retry policy at BlockInputStream level will address this, and a client is supposed to only hit the block token expiration rarely.
Just a suggestion: can you update the jira subject so that it is easier for the community to tell this is a behavior change, not just fixing flaky tests?
Other than that I am +1.
|
@jojochuang , thanks a lot for reviewing this! Updated the title. |
|
@szetszwo thanks for the improvement. Sorry for being late, but I tried this fix in repeated CI runs and it didn't seem to fix the problem. 56 out of 300 iterations failed: |
|
@adoroszlai , thanks a lot for testing it! The line numbers do not match the latest code. I guess it was testing the old code? |
No, it was testing the PR, you can find diff here: master...adoroszlai:hadoop-ozone:HDDS-8024-repeat Line numbers in |
@adoroszlai , a01676a is HDDS-8029 which is BEFORE this commit 3bbb574 as shown below. Anyway, could you run the test again with the latest code? |
The lines added by a01676a were not present in your PR, as it was based on a prior state of Now that the PR has been merged and became 3bbb574 on Now I have launched a new run with current |
This one failed, too. Example: |
|
@adoroszlai , thanks for the update. Sorry that I may have misunderstood how the test was run.
The line number in this stack trace matched the current code. Let me take a look. |
|
@adoroszlai found that getBlock also does not retry other datnaodes. Submitted #4357 |
What changes were proposed in this pull request?
When failing read chunk from a datanode, exclude it and retry other datanodes.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8024
How was this patch tested?
Updated TestHSync