Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -147,8 +147,12 @@ public EndpointStateMachine.EndPointStates call() throws Exception {
rpcEndpoint.setLastSuccessfulHeartbeat(ZonedDateTime.now());
rpcEndpoint.zeroMissedCount();
} catch (IOException ex) {
// put back the reports which failed to be sent
putBackReports(requestBuilder);
// don't resend reports to recon as it could be down for days
// DN is expected to work fine without recon and not go OOM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't put back container actions and pipeline actions for SCM, or we have a legacy reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GlenGeng This change shouldn't affect SCM at all. Since SCM should be active (rpcEndpoint.isPassive() == false).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I believe container actions and pipeline actions are put back on exception. cmiiw.

cc @nandakumar131 @lokeshj1703

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a concise fix. I am fine with it.
The question about container/pipeline Action won't be a blocker of merging this fix.

if (!rpcEndpoint.isPassive()) {
// put back the reports which failed to be sent
putBackReports(requestBuilder);
}
Copy link
Contributor

@linyiqun linyiqun Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of Recon is down , current change can fix this.
But for SCM is down, it could still lead OOM error I think since we put back reports again and again.
A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

If there is the connection issue of remote server, then leads the IOException. Can we maintain the lastAddedReport info to help do the comparison of current report that will be added in StateContext? Actually, this type report should not be added again if there is no any difference.

Copy link
Contributor Author

@smengcl smengcl Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment @linyiqun .

Yes. If SCM is down for a very long time, DN can still OOM due to the same reason as Recon. So this is, at best, a quick fix.

A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

@nandakumar131 mentioned in the jira comment that putBackReports() serves the purpose of informing SCM of block deletion. In this case we don't want to just toss those reports away when SCM is down. Checking the exact type of exception (e.g. if it is a connection issue) is a good idea though.

I agree in the long run the fix could be to only queue the latest and necessary report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the quick fix for current issue, I'm okay for this change.


rpcEndpoint.logIfNeeded(ex);
} finally {
Expand Down