-
Notifications
You must be signed in to change notification settings - Fork 592
HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -147,8 +147,12 @@ public EndpointStateMachine.EndPointStates call() throws Exception { | |
| rpcEndpoint.setLastSuccessfulHeartbeat(ZonedDateTime.now()); | ||
| rpcEndpoint.zeroMissedCount(); | ||
| } catch (IOException ex) { | ||
| // put back the reports which failed to be sent | ||
| putBackReports(requestBuilder); | ||
| // don't resend reports to recon as it could be down for days | ||
| // DN is expected to work fine without recon and not go OOM | ||
| if (!rpcEndpoint.isPassive()) { | ||
| // put back the reports which failed to be sent | ||
| putBackReports(requestBuilder); | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the case of Recon is down , current change can fix this. If there is the connection issue of remote server, then leads the IOException. Can we maintain the lastAddedReport info to help do the comparison of current report that will be added in StateContext? Actually, this type report should not be added again if there is no any difference.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the comment @linyiqun . Yes. If SCM is down for a very long time, DN can still OOM due to the same reason as Recon. So this is, at best, a quick fix.
@nandakumar131 mentioned in the jira comment that I agree in the long run the fix could be to only queue the latest and necessary report.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As the quick fix for current issue, I'm okay for this change. |
||
|
|
||
| rpcEndpoint.logIfNeeded(ex); | ||
| } finally { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't put back container actions and pipeline actions for SCM, or we have a legacy reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GlenGeng This change shouldn't affect SCM at all. Since SCM should be active (
rpcEndpoint.isPassive() == false).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I believe container actions and pipeline actions are put back on exception. cmiiw.
cc @nandakumar131 @lokeshj1703
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a concise fix. I am fine with it.
The question about container/pipeline Action won't be a blocker of merging this fix.