-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-4754. Make scm heartbeat rpc retry interval configurable #1942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please take a look @GlenGeng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?
The retry interval is only 1 sec now, which is for quickly connecting the scm. The default HB interval may be too long.
Actually, the retry count is not working, since the DatanodeStateMachine keeps retrying after 15 retries finish.
The current retry policy seems still needs to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, get it.
There is another place that also can be updated to use getScmRpcRetryInterval(conf) in this class. Can you update this (SCMConnectionManager.java#L200)?
/**
* Adds a new Recon server to the set of endpoints.
* @param address Recon address.
* @throws IOException
*/
public void addReconServer(InetSocketAddress address) throws IOException {
LOG.info("Adding Recon Server : {}", address.toString());
writeLock();
try {
if (scmMachines.containsKey(address)) {
LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
"Ignoring the request.");
return;
}
Configuration hadoopConfig =
LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
RPC.setProtocolEngine(hadoopConfig, ReconDatanodeProtocolPB.class,
ProtobufRpcEngine.class);
long version =
RPC.getProtocolVersion(ReconDatanodeProtocolPB.class);
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(
getScmRpcRetryCount(conf),
1000, TimeUnit.MILLISECONDS); <======
...
}41b4405 to
39b1f5c
Compare
linyiqun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comment, @Xushaohong .
LGTM, +1. Let's wait @GlenGeng to have a final review for this PR, : ).
|
+1 Thanks @Xushaohong for the work. Thanks @linyiqun for the review. This is a preliminary work for the SCM OOM issue. Future proposal will be throttling the on-going reports at both SCM side and DN side, e.g., 1) SCM drops the reports if it has queued too many reports, 2) DN reduces the number of reports by recording a lease for its Container(recommended by @xiaoyuyao ). |
|
@runzhiwang pls take a look and help merge :) |
runzhiwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @Xushaohong Thanks the patch. @linyiqun @GlenGeng Thanks for review.
…ing-upgrade * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
…ing-upgrade-merge-candidate * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconNodeManager.java
* HDDS-3698-nonrolling-upgrade: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ...
What changes were proposed in this pull request?
Background:
The current retry policy of DN is to retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to the restart of SCM, all DNs will send container report to SCM nearly at the same time, which is a ContainerReport Storm.
Solution:
Manually adjust the rpc-retry-interval with rpc-retry-count could mitigate extreme cases such as OOM, when facing up a huge cluster.
Make the rpc-retry-interval configurable.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4754
How was this patch tested?
CI