HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552

smengcl · 2020-11-03T21:49:25Z

What changes were proposed in this pull request?

HeartbeatEndpointTask#call should not resend reports to passive endpoints (Recon)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4404

How was this patch tested?

No new test added. Existing ones should do.

…w in processing reports Change-Id: I94df595074a0b3cff7acdd4e220302d6f7bd49b0

GlenGeng-awx · 2020-11-05T06:32:59Z

...ain/java/org/apache/hadoop/ozone/container/common/states/endpoint/HeartbeatEndpointTask.java

-      // put back the reports which failed to be sent
-      putBackReports(requestBuilder);
+      // don't resend reports to recon as it could be down for days
+      // DN is expected to work fine without recon and not go OOM


Why don't put back container actions and pipeline actions for SCM, or we have a legacy reason?

@GlenGeng This change shouldn't affect SCM at all. Since SCM should be active (rpcEndpoint.isPassive() == false).

Also I believe container actions and pipeline actions are put back on exception. cmiiw.

cc @nandakumar131 @lokeshj1703

This is a concise fix. I am fine with it.
The question about container/pipeline Action won't be a blocker of merging this fix.

linyiqun · 2020-11-10T09:25:30Z

...ain/java/org/apache/hadoop/ozone/container/common/states/endpoint/HeartbeatEndpointTask.java

+      if (!rpcEndpoint.isPassive()) {
+        // put back the reports which failed to be sent
+        putBackReports(requestBuilder);
+      }


For the case of Recon is down , current change can fix this.
But for SCM is down, it could still lead OOM error I think since we put back reports again and again.
A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

If there is the connection issue of remote server, then leads the IOException. Can we maintain the lastAddedReport info to help do the comparison of current report that will be added in StateContext? Actually, this type report should not be added again if there is no any difference.

Thanks for the comment @linyiqun .

Yes. If SCM is down for a very long time, DN can still OOM due to the same reason as Recon. So this is, at best, a quick fix.

A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

@nandakumar131 mentioned in the jira comment that putBackReports() serves the purpose of informing SCM of block deletion. In this case we don't want to just toss those reports away when SCM is down. Checking the exact type of exception (e.g. if it is a connection issue) is a good idea though.

I agree in the long run the fix could be to only queue the latest and necessary report.

As the quick fix for current issue, I'm okay for this change.

smengcl · 2020-11-19T23:17:10Z

Closing this PR as the proper fix is being done in #1601

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slo…

7f7d333

…w in processing reports Change-Id: I94df595074a0b3cff7acdd4e220302d6f7bd49b0

jojochuang requested review from avijayanhwx and nandakumar131 November 3, 2020 22:14

GlenGeng-awx reviewed Nov 5, 2020

View reviewed changes

linyiqun reviewed Nov 10, 2020

View reviewed changes

smengcl mentioned this pull request Nov 19, 2020

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1601

Merged

smengcl closed this Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552

Uh oh!

smengcl commented Nov 3, 2020

Uh oh!

GlenGeng-awx Nov 5, 2020

Uh oh!

smengcl Nov 9, 2020

Uh oh!

smengcl Nov 9, 2020

Uh oh!

GlenGeng-awx Nov 10, 2020

Uh oh!

linyiqun Nov 10, 2020 •

edited

Loading

Uh oh!

smengcl Nov 13, 2020 •

edited

Loading

Uh oh!

linyiqun Nov 14, 2020

Uh oh!

smengcl commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552

HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports #1552

Uh oh!

Conversation

smengcl commented Nov 3, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx Nov 5, 2020

Choose a reason for hiding this comment

Uh oh!

smengcl Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

smengcl Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

linyiqun Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smengcl Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linyiqun Nov 14, 2020

Choose a reason for hiding this comment

Uh oh!

smengcl commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linyiqun Nov 10, 2020 •

edited

Loading

smengcl Nov 13, 2020 •

edited

Loading