Skip to content

Conversation

@JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

when investigating SCM OOM, I find that datanode will always report full information about containers,pipeline and node.
By default , ContainerReportPublisher thread runs periodically (HDDS_CONTAINER_REPORT_INTERVAL, default 60s) in Datanode , and The HeartbeatEndpointTask ,which runs periodically (hdds.heartbeat.interval)should only report information in incrementalReportQueue

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5111

How was this patch tested?

unit test

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5111 branch 2 times, most recently from ef466a2 to 11a2abe Compare April 25, 2021 03:32
@JacksonYao287 JacksonYao287 reopened this Apr 25, 2021
@JacksonYao287 JacksonYao287 force-pushed the HDDS-5111 branch 4 times, most recently from c88378d to f381601 Compare April 26, 2021 10:00
@JacksonYao287 JacksonYao287 reopened this Apr 26, 2021
@JacksonYao287 JacksonYao287 force-pushed the HDDS-5111 branch 2 times, most recently from 26180fd to b43325f Compare April 26, 2021 12:54
@adoroszlai adoroszlai changed the title HDDS-5111.DataNode should not always report full information in heartbeat HDDS-5111. DataNode should not always report full information in heartbeat Apr 27, 2021
@JacksonYao287 JacksonYao287 force-pushed the HDDS-5111 branch 3 times, most recently from 013ae29 to 3203e45 Compare May 6, 2021 11:55
@JacksonYao287 JacksonYao287 reopened this May 7, 2021
@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented May 7, 2021

@avijayanhwx @symious @adoroszlai please take a review , thanks!

@JacksonYao287
Copy link
Contributor Author

@GlenGeng @bshashikant can you help reviewing this PR? thx

@bshashikant
Copy link
Contributor

Thanks @JacksonYao287 for the patch. I am not very sure whether we should do this. Any incremental change on the datanode with respect to pipeline/containers is initiated by either pipeline or container actions. A report ideally means a full report status of datanode to SCM. If the reports need to be sent incrementally, there can be cases where SCM marks conatiner replica's missing and trigger re-replication.

I think, it needs some more thought and some more discussion to arrive to a conclusion on this. cc @nandakumar131 , @mukul1987

@bshashikant bshashikant reopened this May 19, 2021
@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented May 19, 2021

@bshashikant , thanks for the review!
for now, there are two kinds of report: IncrementalReport, which will be put into incrementalMessageQueue, and FullReport, which will be refreshed periodically.
for IncrementalReport, if container/pipeline take new actions, they want SCM to know this action is done as soon as possible by sending through sendICR and triggering a new heartbeat proactively , so that SCM can be acquired this and update the corresponding state of container/pipeline as fast as possible and then take the next action.(eg, reply to OM).
for FullReport , they are published periodically and controlled by hdds.node.report.interval,hdds.container.report.interval,hdds.pipeline.report.interval, so if these parameters are longer than heartbeat interval, they should not exist in every heartbeat information.

A report ideally means a full report status of datanode to SCM

so I think a report may not always mean a full report. In my test environment , there are about a million containers. if scm is always acquired full container report, i think it will be a heavy burden for network, scm and datanode

for now , there are two problems.
1 acorrding to the current implementation, heartbeat will always report full report , even if they just only want to send an incremental report,and thus hdds.node.report.interval,hdds.container.report.interval,hdds.pipeline.report.interval do not take effect , i think this is a bug , and i fix it.

2 current logic of "scmcontex#addreport" is a little confused, so i refactor this , split it into to two functions, so that the logic is now clear.

 private void sendPipelineReport() {
    if (context !=  null) {
      // TODO: Send IncrementalPipelineReport instead of full PipelineReport
      context.addReport(context.getParent().getContainer().getPipelineReport());
      context.getParent().triggerHeartbeat();
    }
  }

after refactoring , now we can send IncrementalPipelineReport instead of full PipelineReport

@bshashikant
Copy link
Contributor

@JacksonYao287 , thanks for the explanation.
public List<GeneratedMessage> getReports(InetSocketAddress endpoint, int maxLimit) { if (maxLimit < 0) { throw new IllegalArgumentException("Illegal maxLimit value: " + maxLimit); } List<GeneratedMessage> reports = getNonIncrementalReports(endpoint); if (maxLimit <= reports.size()) { return reports.subList(0, maxLimit); } else { reports.addAll(getIncrementalReports(endpoint, maxLimit - reports.size())); return reports; } }

The code here suggests, it still tries to send the entire report if it less than the configured max limit otherwise will send the incremental report. This will be sent in each HearBeatEndPointTask. Is my understanding correct?

@bshashikant
Copy link
Contributor

bshashikant commented May 20, 2021

@avijayanhwx , can you have a look at this? This is related to changes to report handling done for Recon as well done for https://issues.apache.org/jira/browse/HDDS-4404.

@JacksonYao287
Copy link
Contributor Author

Thanks @bshashikant !

The code here suggests, it still tries to send the entire report if it less than the configured max limit otherwise will send the incremental report. This will be sent in each HearBeatEndPointTask. Is my understanding correct?

  public List<GeneratedMessage> getAllAvailableReports(
      InetSocketAddress endpoint) {
    return getReports(endpoint, Integer.MAX_VALUE);
  }

for now, the code above is the only place where StateContext#getReports is called, and thus maxLimit will always be Integer.MAX_VALUE. So actually ,maxLimit here does not take any effect and fullReport will always be sent in every heartBeatEndpointTask.

by the way , i think we should not limit the total size of a heartbeat report . What we should do is just making sure that the datanode always sends what it should send , which is controlled by the parameters in the configuration file.

@avijayanhwx PTAL

@avijayanhwx
Copy link
Contributor

Thanks for working on this @JacksonYao287. I understand the approach being implemented here, and the solutions looks fine to me. In your setup, is the OOM caused due to multiple reports being bottlenecked at the event queue layer?

cc @smengcl for review as well.

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Jun 2, 2021

Thanks @avijayanhwx for the review!

In your setup, is the OOM caused due to multiple reports being bottlenecked at the event queue layer?

yes, when there is a large number of containers(in my cluster , about 1 million containers, heartbeat interval is set to 2s and container size is set to 128m), if datanode always send full report, the single thread report handler may not handle this stress, so there will be more and more container report wait for handling, this leads to the scm oom.

@smengcl can you help reviewing this pr?

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JacksonYao287 . The overall logic makes sense to me by reading it.
This patch seeks to eliminate repeatedly sending the same full report to the same endpoint over and over again. This was indeed a problem before this patch.

Overall lgtm. Minor nits inline.

btw have you done any external / manual testing on this?

…ozone/container/common/statemachine/StateContext.java


add Code Comments

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
@JacksonYao287
Copy link
Contributor Author

Thanks @smengcl very much for the review !

btw have you done any external / manual testing on this?

i have applied this patch to my own code, and it has been running in my test cluster for several weeks. until now , lt seems well

Jackson Yao and others added 2 commits June 3, 2021 16:10
…ozone/container/common/statemachine/StateContext.java

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
…ozone/container/common/statemachine/StateContext.java

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
Jackson Yao added 4 commits June 3, 2021 16:49
@JacksonYao287
Copy link
Contributor Author

updated !
@avijayanhwx @smengcl PTAL, thanks

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@JacksonYao287
Copy link
Contributor Author

Thanks for the review @bshashikant @avijayanhwx @smengcl !
can we merge this pr now ?

Jackson Yao and others added 2 commits June 5, 2021 18:21
…ozone/container/common/statemachine/StateContext.java


check before AtomicReference#get

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
@smengcl smengcl merged commit 39954ad into apache:master Jun 7, 2021
@smengcl
Copy link
Contributor

smengcl commented Jun 7, 2021

Thanks @JacksonYao287 for the contribution!

@JacksonYao287 JacksonYao287 deleted the HDDS-5111 branch June 8, 2021 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants