HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942

Xushaohong · 2021-02-19T07:33:12Z

What changes were proposed in this pull request?

Background:
The current retry policy of DN is to retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to the restart of SCM, all DNs will send container report to SCM nearly at the same time, which is a ContainerReport Storm.

Solution:
Manually adjust the rpc-retry-interval with rpc-retry-count could mitigate extreme cases such as OOM, when facing up a huge cluster.
Make the rpc-retry-interval configurable.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4754

How was this patch tested?

CI

Xushaohong · 2021-02-19T08:42:02Z

Please take a look @GlenGeng

linyiqun · 2021-02-20T07:48:11Z

...rc/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java

Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?

Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?

The retry interval is only 1 sec now, which is for quickly connecting the scm. The default HB interval may be too long.
Actually, the retry count is not working, since the DatanodeStateMachine keeps retrying after 15 retries finish.
The current retry policy seems still needs to be changed.

Okay, get it.
There is another place that also can be updated to use getScmRpcRetryInterval(conf) in this class. Can you update this (SCMConnectionManager.java#L200)?

/** * Adds a new Recon server to the set of endpoints. * @param address Recon address. * @throws IOException */ public void addReconServer(InetSocketAddress address) throws IOException { LOG.info("Adding Recon Server : {}", address.toString()); writeLock(); try { if (scmMachines.containsKey(address)) { LOG.warn("Trying to add an existing SCM Machine to Machines group. " + "Ignoring the request."); return; } Configuration hadoopConfig = LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf); RPC.setProtocolEngine(hadoopConfig, ReconDatanodeProtocolPB.class, ProtobufRpcEngine.class); long version = RPC.getProtocolVersion(ReconDatanodeProtocolPB.class); RetryPolicy retryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep( getScmRpcRetryCount(conf), 1000, TimeUnit.MILLISECONDS); <====== ... }

linyiqun

Thanks for addressing the comment, @Xushaohong .
LGTM, +1. Let's wait @GlenGeng to have a final review for this PR, : ).

GlenGeng-awx · 2021-02-22T03:30:06Z

+1

Thanks @Xushaohong for the work. Thanks @linyiqun for the review.

This is a preliminary work for the SCM OOM issue. Future proposal will be throttling the on-going reports at both SCM side and DN side, e.g., 1) SCM drops the reports if it has queued too many reports, 2) DN reduces the number of reports by recording a lease for its Container(recommended by @xiaoyuyao ).

Xushaohong · 2021-02-22T03:36:27Z

@runzhiwang pls take a look and help merge :)

runzhiwang

LGTM. @Xushaohong Thanks the patch. @linyiqun @GlenGeng Thanks for review.

…ing-upgrade * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java

…ing-upgrade-merge-candidate * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconNodeManager.java

* HDDS-3698-nonrolling-upgrade: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ...

linyiqun reviewed Feb 20, 2021

View reviewed changes

HDDS-4754. Make scm heartbeat rpc retry interval configurable

39b1f5c

Xushaohong force-pushed the HDDS-4754 branch from 41b4405 to 39b1f5c Compare February 20, 2021 11:08

linyiqun approved these changes Feb 20, 2021

View reviewed changes

Xushaohong closed this Feb 22, 2021

Xushaohong reopened this Feb 22, 2021

runzhiwang approved these changes Feb 25, 2021

View reviewed changes

runzhiwang merged commit de9884f into apache:master Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942

HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942

Uh oh!

Xushaohong commented Feb 19, 2021

Uh oh!

Xushaohong commented Feb 19, 2021

Uh oh!

linyiqun Feb 20, 2021

Uh oh!

Xushaohong Feb 20, 2021

Uh oh!

linyiqun Feb 20, 2021

Uh oh!

linyiqun left a comment

Uh oh!

GlenGeng-awx commented Feb 22, 2021

Uh oh!

Xushaohong commented Feb 22, 2021

Uh oh!

runzhiwang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942

HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942

Uh oh!

Conversation

Xushaohong commented Feb 19, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Xushaohong commented Feb 19, 2021

Uh oh!

linyiqun Feb 20, 2021

Choose a reason for hiding this comment

Uh oh!

Xushaohong Feb 20, 2021

Choose a reason for hiding this comment

Uh oh!

linyiqun Feb 20, 2021

Choose a reason for hiding this comment

Uh oh!

linyiqun left a comment

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx commented Feb 22, 2021

Uh oh!

Xushaohong commented Feb 22, 2021

Uh oh!

runzhiwang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants