Skip to content

Conversation

@smengcl
Copy link
Contributor

@smengcl smengcl commented Nov 19, 2020

This is an improved version of the previous fix (#1552) for HDDS-4404.

The approach is relatively straightforward: DN will only save the latest containerReport/nodeReport/pipelineReport in StateContext (instead of the previous behavior where DN keeps putting all reports back on communication error.)

Testing

  • Existing tests in TestHeartbeatEndpointTask passed locally.
  • New UTs added.
  • Internal scale test w/ 1M containers passed: recon unresponsive for 48 hours, DN heap usage kept at the same level -- w/o patch the DNs OOM'ed after 24 hours (All 15 DNs max heap set to 32GB, reports queue kept growing).

…w in processing reports.

Change-Id: Idf6394d0bde2fb1c9fe99a96c49fa28677e0e527
@smengcl smengcl self-assigned this Nov 19, 2020
Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smengcl Thanks for working on this. Approach looks good to me.

Can we add a unit test to make sure the basic works? (addReport updates the report, and subsequent getReports should get the newly added report)

Change-Id: I6c4729634905bf6bba5bb6483c393d3c7511a78e
@smengcl
Copy link
Contributor Author

smengcl commented Nov 19, 2020

UT TestStateContext#testReportAPIs is not happy with my change in StateContext#addReport:

Mocked object GeneratedMessage report is fed to addReport(), so report.getDescriptorForType() returns null. As a result report.getDescriptorForType().getFullName() throws NPE in the UT.

Change-Id: I63470510ed2089438f08f4d3cea321c1b5f01c3c
Change-Id: Id3ff6222ed279568d32876afc0a6a5ee1316c926
Change-Id: I25020b4411f4163a2ec134431793bfe663e5f6bf
Change-Id: I2dc93c42fc7468b37d41a179ca047ee385fc8896
final String reportType = report.getDescriptorForType().getFullName();
for (InetSocketAddress endpoint : endpoints) {
// We only keep the latest container, node and pipeline report
if (reportType.equals(CONTAINER_REPORTS_PROTO_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to synchronize the get/set on these containerReport/nodeReport/pipelineReport ?

Copy link
Contributor Author

@smengcl smengcl Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, container/node/pipeline should be fine. We need synchroized on reports because we want Map get and add not to get messed in between.

Copy link
Contributor

@runzhiwang runzhiwang Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thread write containerReport and another thread read containerReport, it's not thread safe, we can use AtomicReference<GeneratedMessage> containerReports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…d by adding mock-maker-inline.

Change-Id: I0f21c508af1f7b36d7632aaa3b9173a17056f211
Change-Id: I8d67a1b6e2d9043fb1479e2547d2e690932f7985
Change-Id: Iaec544ba6fb211d0d71fe03b26b0fc70199dc882
@runzhiwang
Copy link
Contributor

runzhiwang commented Nov 23, 2020

@smengcl Thanks the patch. It looks good to me overall. But I think we need more strong integration test to cover the case: StateContext#reports will not contain any ContainerReport, NodeReport, and DatanodeReport.

…e only used in UT, at least at the moment.

Change-Id: I72070fde8df220ecebf2fb6d1015d19f65cee11e
@smengcl
Copy link
Contributor Author

smengcl commented Nov 23, 2020

@smengcl Thanks the patch. It looks good to me overall. But I think we need more strong integration test to cover the case: StateContext#reports will not contain any ContainerReport, NodeReport, and DatanodeReport.

Thanks for the review @runzhiwang . In a previous revision I had a strict check in StateContext#addReport:
22846f4#diff-4e474a0cbb28091ed4fab8ee80878c3acf7e15711fba2a0d83bacd56ea8f5707L237-L240

which ensures only CommandStatusReports and IncrementalContainerReport would ever be added to reports queue. Later this is removed since @avijayanhwx suggested we don't really need that.

The only two paths where reports would be queued are addReports and StateContext#putBackReports, where HeartbeatEndpointTask#putBackReports makes sure we only put back CommandStatusReports and IncrementalContainerReport.

Would you elaborate on the integration test plan you have on your mind?

@runzhiwang
Copy link
Contributor

Thanks for the review @runzhiwang . In a previous revision I had a strict check in StateContext#addReport:
22846f4#diff-4e474a0cbb28091ed4fab8ee80878c3acf7e15711fba2a0d83bacd56ea8f5707L237-L240

which ensures only CommandStatusReports and IncrementalContainerReport would ever be added to reports queue. Later this is removed since @avijayanhwx suggested we don't really need that.

The only two paths where reports would be queued are addReports and StateContext#putBackReports, where HeartbeatEndpointTask#putBackReports makes sure we only put back CommandStatusReports and IncrementalContainerReport.

Would you elaborate on the integration test plan you have on your mind?

@smengcl OOM is a serious problem, so we need strong test. If somebody change the existed two path, or add the third path, and put ContainerReport in the queue, OOM will happen, so we need some restrain. The integration test maybe only can cover the existed two path, for example, you can decrease the time interval of ReportPublisher#call and increase the time interval of EndpointStateMachine.EndPointStates call(), so that ContainerReport generated many times but has not been send out, then we check the reports do not contain ContainerReport.

Copy link
Contributor Author

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @runzhiwang . In a previous revision I had a strict check in StateContext#addReport:
22846f4#diff-4e474a0cbb28091ed4fab8ee80878c3acf7e15711fba2a0d83bacd56ea8f5707L237-L240
which ensures only CommandStatusReports and IncrementalContainerReport would ever be added to reports queue. Later this is removed since @avijayanhwx suggested we don't really need that.
The only two paths where reports would be queued are addReports and StateContext#putBackReports, where HeartbeatEndpointTask#putBackReports makes sure we only put back CommandStatusReports and IncrementalContainerReport.
Would you elaborate on the integration test plan you have on your mind?

@smengcl OOM is a serious problem, so we need strong test. If somebody change the existed two path, or add the third path, and put ContainerReport in the queue, OOM will happen, so we need some restrain. The integration test maybe only can cover the existed two path, for example, you can decrease the time interval of ReportPublisher#call and increase the time interval of EndpointStateMachine.EndPointStates call(), so that ContainerReport generated many times but has not been send out, then we check the reports do not contain ContainerReport.

Thanks @runzhiwang .

I think a better and easier way to check this behavior is to add a new UT that calls addReports and putBackReports repetitively, then check the reports queue length. Should only expect the queue grow when report type is CommandStatusReport or IncrementalContainerReport, but not any other ones. What do you think?

…e functionality of the map.

- Added `acceptedIncrementalReportTypeSet` for rigorous report type checking.
- `StateContext#addReport` now explicitly throws, instead of hiding any possible NPEs. Rationale: As I think again, those reports are all internally generated rather than dependent on direct user input. It's better to catch any problem early (if any).
- `StateContext#addReport` also strictly checks for report type now. This intends to force any future additions of report types to be explicitly allowed. Only 5 report types (names) are expected:
```
hadoop.hdds.NodeReportProto
hadoop.hdds.ContainerReportsProto
hadoop.hdds.IncrementalContainerReportProto
hadoop.hdds.CommandStatusReportsProto
hadoop.hdds.PipelineReportsProto
```
- `StateContext#putBackReports` now checks ALL reports in the list to be put back. I'd expect the list to be small. And since the logic is run on each DataNode, it shouldn't become a performance bottleneck. Also added debug logging just in case we need to diagnose it in the future.
- Added new UT `testPutBackReports` and `testReportQueueWithAddReports` to check that `putBackReports` and `getReports` behaves.

Change-Id: I48aca3d4a422d52532b39c9462a30519c9cbe0eb
@smengcl
Copy link
Contributor Author

smengcl commented Nov 24, 2020

Summary of major changes to the latest diff:

  • Renamed reports to incrementalReportsQueue to better reflect the functionality of the map.
  • Added acceptedIncrementalReportTypeSet for rigorous report type checking.
  • StateContext#addReport now explicitly throws, instead of hiding any possible NPEs. Rationale: As I think again, those reports are all internally generated rather than dependent on direct user input. It's better to catch any problem early (if any).
  • StateContext#addReport also strictly checks for report type now. This intends to force any future additions of report types to be explicitly allowed. Only 5 report types (names) are expected:
hadoop.hdds.NodeReportProto
hadoop.hdds.ContainerReportsProto
hadoop.hdds.IncrementalContainerReportProto
hadoop.hdds.CommandStatusReportsProto
hadoop.hdds.PipelineReportsProto
  • StateContext#putBackReports now checks ALL reports in the list to be put back. I'd expect the list to be small. And since the logic is run on each DataNode, it shouldn't become a performance bottleneck. Also added debug logging just in case we need to diagnose it in the future.
  • Added new UT testPutBackReports and testReportQueueWithAddReports to check that putBackReports and getReports behaves.

Let me know if you think I missed anything, or it might be a bit over overkill.

@runzhiwang
Copy link
Contributor

Thanks for the review @runzhiwang . In a previous revision I had a strict check in StateContext#addReport:
22846f4#diff-4e474a0cbb28091ed4fab8ee80878c3acf7e15711fba2a0d83bacd56ea8f5707L237-L240
which ensures only CommandStatusReports and IncrementalContainerReport would ever be added to reports queue. Later this is removed since @avijayanhwx suggested we don't really need that.
The only two paths where reports would be queued are addReports and StateContext#putBackReports, where HeartbeatEndpointTask#putBackReports makes sure we only put back CommandStatusReports and IncrementalContainerReport.
Would you elaborate on the integration test plan you have on your mind?

@smengcl OOM is a serious problem, so we need strong test. If somebody change the existed two path, or add the third path, and put ContainerReport in the queue, OOM will happen, so we need some restrain. The integration test maybe only can cover the existed two path, for example, you can decrease the time interval of ReportPublisher#call and increase the time interval of EndpointStateMachine.EndPointStates call(), so that ContainerReport generated many times but has not been send out, then we check the reports do not contain ContainerReport.

Thanks @runzhiwang .

I think a better and easier way to check this behavior is to add a new UT that calls addReports and putBackReports repetitively, then check the reports queue length. Should only expect the queue grow when report type is CommandStatusReport or IncrementalContainerReport, but not any other ones. What do you think?

@smengcl I agree.

Change-Id: I44aba0ff093fa0ff5f9a35a74ff4126d9c73dd7b
Change-Id: I545ca8c636b8bbd1239a936603b6efcbfbd10b88
Change-Id: I5cdc38a297cf3d91bad8335eb36f2ba6b2009c10
private final Set<InetSocketAddress> endpoints;
private final Map<InetSocketAddress, List<GeneratedMessage>> reports;
// Only the latest full report of each type is kept
private final AtomicReference<GeneratedMessage> containerReports;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may make them to be volatile, which is simpler but thread safe as well,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I will keep atomic for now as the report is not updated so frequently on DNs (low performance impact).

…ntainer/Node/Pipeline reports.

Thanks Aravindan.

Change-Id: Ic39f53de8bb91a179bd5dc3301aa89ffecf5c076
Copy link
Contributor

@runzhiwang runzhiwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@runzhiwang runzhiwang closed this Nov 30, 2020
@runzhiwang runzhiwang reopened this Nov 30, 2020
@runzhiwang
Copy link
Contributor

Reopen pr to trigger CI

@runzhiwang runzhiwang closed this Dec 1, 2020
@runzhiwang runzhiwang reopened this Dec 1, 2020
@runzhiwang runzhiwang closed this Dec 1, 2020
@runzhiwang runzhiwang reopened this Dec 1, 2020
@runzhiwang
Copy link
Contributor

Waiting CI pass.

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @smengcl. LGTM +1 pending CI.

@smengcl
Copy link
Contributor Author

smengcl commented Dec 1, 2020

Thanks @runzhiwang @avijayanhwx @GlenGeng for the review and comments.

CI still has unrelated failures (TestOzoneManagerHAWithData#testOMRestart and acceptance.ozone-om-ha-s3). I will retrigger one more time.

Change-Id: I46fb65ce2c81cb607dd10a9ee829d829ec7a8cfc
Change-Id: I9d3b5ce0b81d457344ecd6b4ad6e141169bd89d3
Change-Id: I4e9dace08152e21c2bb5da564487d776a84c4ade
Change-Id: Iea89129865df35ca54d85307bac49ede5da45ea7
…stStorageContainerManager#testBlockDeletionTransactions.

Change-Id: I16c8bfcfbd619a23241ea6dda9b95265714984bb
@smengcl
Copy link
Contributor Author

smengcl commented Dec 9, 2020

TestStorageContainerManager#testBlockDeletionTransactions was failing for a legitimate reason.
The small issue is addressed by simply returning (instead of throwing) when null is passed into StateContext#addReport.

Change-Id: I64203ebbaeebdb4c598bb8dad578579c82bf604b
@smengcl
Copy link
Contributor Author

smengcl commented Dec 10, 2020

TestDecommissionAndMaintenance#testNodeWithOpenPipelineCanBeDecommissioned passed locally for me. Retriggering CI.

@runzhiwang runzhiwang merged commit 0152eb5 into apache:master Dec 14, 2020
@runzhiwang
Copy link
Contributor

@smengcl Thanks the patch. @adoroszlai @avijayanhwx @GlenGeng Thanks for review, I have merged the patch.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jan 18, 2021
…w in processing reports (apache#1601)

* HDDS-4404. Datanode can go OOM when a Recon or SCM Server is very slow in processing reports.

Change-Id: Idf6394d0bde2fb1c9fe99a96c49fa28677e0e527

* Checkstyle.

Change-Id: I6c4729634905bf6bba5bb6483c393d3c7511a78e

* Add mock-maker-inline so UT can mock final classes.

Change-Id: I63470510ed2089438f08f4d3cea321c1b5f01c3c

* Remove throw exception in addReport.

Change-Id: Id3ff6222ed279568d32876afc0a6a5ee1316c926

* Fix existing UT TestStateContext#testReportAPIs.

Change-Id: I25020b4411f4163a2ec134431793bfe663e5f6bf

* Remove unused imports.

Change-Id: I2dc93c42fc7468b37d41a179ca047ee385fc8896

* Work around mockito bug in UT TestCreatePipelineCommandHandler: caused by adding mock-maker-inline.

Change-Id: I0f21c508af1f7b36d7632aaa3b9173a17056f211

* Add UT testContainerNodePipelineReportAPIs.

Change-Id: I8d67a1b6e2d9043fb1479e2547d2e690932f7985

* Clean up.

Change-Id: Iaec544ba6fb211d0d71fe03b26b0fc70199dc882

* @VisibleForTesting for get container/node/pipeline reports as they are only used in UT, at least at the moment.

Change-Id: I72070fde8df220ecebf2fb6d1015d19f65cee11e

* - Renamed `reports` to `incrementalReportsQueue` to better reflect the functionality of the map.
- Added `acceptedIncrementalReportTypeSet` for rigorous report type checking.
- `StateContext#addReport` now explicitly throws, instead of hiding any possible NPEs. Rationale: As I think again, those reports are all internally generated rather than dependent on direct user input. It's better to catch any problem early (if any).
- `StateContext#addReport` also strictly checks for report type now. This intends to force any future additions of report types to be explicitly allowed. Only 5 report types (names) are expected:
```
hadoop.hdds.NodeReportProto
hadoop.hdds.ContainerReportsProto
hadoop.hdds.IncrementalContainerReportProto
hadoop.hdds.CommandStatusReportsProto
hadoop.hdds.PipelineReportsProto
```
- `StateContext#putBackReports` now checks ALL reports in the list to be put back. I'd expect the list to be small. And since the logic is run on each DataNode, it shouldn't become a performance bottleneck. Also added debug logging just in case we need to diagnose it in the future.
- Added new UT `testPutBackReports` and `testReportQueueWithAddReports` to check that `putBackReports` and `getReports` behaves.

Change-Id: I48aca3d4a422d52532b39c9462a30519c9cbe0eb

* Make containerReports, nodeReport, pipelineReports atomic and final.

Change-Id: I44aba0ff093fa0ff5f9a35a74ff4126d9c73dd7b

* Checkstyle.

Change-Id: I545ca8c636b8bbd1239a936603b6efcbfbd10b88

* Clean up.

Change-Id: I5cdc38a297cf3d91bad8335eb36f2ba6b2009c10

* Remove assertion in putBackReports as requestBuilder might include Container/Node/Pipeline reports.

Thanks Aravindan.

Change-Id: Ic39f53de8bb91a179bd5dc3301aa89ffecf5c076

* Empty commit to retrigger CI.

Change-Id: I46fb65ce2c81cb607dd10a9ee829d829ec7a8cfc

* Empty commit to retrigger CI.

Change-Id: I9d3b5ce0b81d457344ecd6b4ad6e141169bd89d3

* Retrigger.

Change-Id: I4e9dace08152e21c2bb5da564487d776a84c4ade

* Allow null as input to StateContext#addReport() as this blocked UT TestStorageContainerManager#testBlockDeletionTransactions.

Change-Id: I16c8bfcfbd619a23241ea6dda9b95265714984bb

* Retrigger CI

Change-Id: I64203ebbaeebdb4c598bb8dad578579c82bf604b

* trigger new CI check

Co-authored-by: Doroszlai, Attila <[email protected]>
(cherry picked from commit 0152eb5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants