Skip to content

Conversation

@smengcl
Copy link
Contributor

@smengcl smengcl commented Nov 3, 2020

What changes were proposed in this pull request?

HeartbeatEndpointTask#call should not resend reports to passive endpoints (Recon)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4404

How was this patch tested?

No new test added. Existing ones should do.

…w in processing reports

Change-Id: I94df595074a0b3cff7acdd4e220302d6f7bd49b0
// put back the reports which failed to be sent
putBackReports(requestBuilder);
// don't resend reports to recon as it could be down for days
// DN is expected to work fine without recon and not go OOM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't put back container actions and pipeline actions for SCM, or we have a legacy reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GlenGeng This change shouldn't affect SCM at all. Since SCM should be active (rpcEndpoint.isPassive() == false).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I believe container actions and pipeline actions are put back on exception. cmiiw.

cc @nandakumar131 @lokeshj1703

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a concise fix. I am fine with it.
The question about container/pipeline Action won't be a blocker of merging this fix.

if (!rpcEndpoint.isPassive()) {
// put back the reports which failed to be sent
putBackReports(requestBuilder);
}
Copy link
Contributor

@linyiqun linyiqun Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of Recon is down , current change can fix this.
But for SCM is down, it could still lead OOM error I think since we put back reports again and again.
A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

If there is the connection issue of remote server, then leads the IOException. Can we maintain the lastAddedReport info to help do the comparison of current report that will be added in StateContext? Actually, this type report should not be added again if there is no any difference.

Copy link
Contributor Author

@smengcl smengcl Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment @linyiqun .

Yes. If SCM is down for a very long time, DN can still OOM due to the same reason as Recon. So this is, at best, a quick fix.

A better way I am thinking for this: if we can check from thrown exception to see if SCM/Recon is out of service, if yes, no need to put back report anymore.

@nandakumar131 mentioned in the jira comment that putBackReports() serves the purpose of informing SCM of block deletion. In this case we don't want to just toss those reports away when SCM is down. Checking the exact type of exception (e.g. if it is a connection issue) is a good idea though.

I agree in the long run the fix could be to only queue the latest and necessary report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the quick fix for current issue, I'm okay for this change.

@smengcl
Copy link
Contributor Author

smengcl commented Nov 19, 2020

Closing this PR as the proper fix is being done in #1601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants