-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…w in processing reports Change-Id: I94df595074a0b3cff7acdd4e220302d6f7bd49b0
| // put back the reports which failed to be sent | ||
| putBackReports(requestBuilder); | ||
| // don't resend reports to recon as it could be down for days | ||
| // DN is expected to work fine without recon and not go OOM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't put back container actions and pipeline actions for SCM, or we have a legacy reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GlenGeng This change shouldn't affect SCM at all. Since SCM should be active (rpcEndpoint.isPassive() == false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I believe container actions and pipeline actions are put back on exception. cmiiw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a concise fix. I am fine with it.
The question about container/pipeline Action won't be a blocker of merging this fix.
| if (!rpcEndpoint.isPassive()) { | ||
| // put back the reports which failed to be sent | ||
| putBackReports(requestBuilder); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case of Recon is down , current change can fix this.
But for SCM is down, it could still lead OOM error I think since we put back reports again and again.
A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.
If there is the connection issue of remote server, then leads the IOException. Can we maintain the lastAddedReport info to help do the comparison of current report that will be added in StateContext? Actually, this type report should not be added again if there is no any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment @linyiqun .
Yes. If SCM is down for a very long time, DN can still OOM due to the same reason as Recon. So this is, at best, a quick fix.
A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.
@nandakumar131 mentioned in the jira comment that putBackReports() serves the purpose of informing SCM of block deletion. In this case we don't want to just toss those reports away when SCM is down. Checking the exact type of exception (e.g. if it is a connection issue) is a good idea though.
I agree in the long run the fix could be to only queue the latest and necessary report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the quick fix for current issue, I'm okay for this change.
|
Closing this PR as the proper fix is being done in #1601 |
What changes were proposed in this pull request?
HeartbeatEndpointTask#callshould not resend reports to passive endpoints (Recon)What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4404
How was this patch tested?
No new test added. Existing ones should do.