HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 #8331

kostacie · 2025-04-24T14:45:56Z

What changes were proposed in this pull request?

This PR changes the value of hdds.scm.safemode.min.datanode to 3 in HddsConfigKeys.

Currently, the default value of hdds.scm.safemode.min.datanode is set to 1.
This means that an user can deploy an Ozone cluster (with default configuration) with just one datanode and the cluster will come out of safemode. After coming out of safemode the writes with three way replication will fail.

By changing the default value of hdds.scm.safemode.min.datanode from 1 to 3, we will make sure that the cluster deployed with default configuration will not fail during write once it's out of safemode.

The users who wants to run Ozone with just one datanode can manually update the config.

What is the link to the Apache JIRA

HDDS-12378

How was this patch tested?

CI:
https://github.com/kostacie/ozone/actions/runs/14641406750

aryangupta1998

Thanks for the patch @kostacie, I would suggest we should also log a warn message in DataNodeSafeModeRule.java's validate() function. We can fetch all the DNs from node manager i.e, based on health(HEALTHY, HEALTHY_READONLY, STALE), if all sums up to 1 that means there's only 1 DN in the cluster, then we should display a warn message to the user to update the "hdds.scm.safemode.min.datanode" to 1.

cc @nandakumar131 @errose28

sarvekshayr

Hi @kostacie.

Shouldn't we change the default value of the config in ozone-default.xml?

ozone/hadoop-hdds/common/src/main/resources/ozone-default.xml

Lines 1675 to 1682 in 5c5db8e

    
           <property> 
        
             <name>hdds.scm.safemode.min.datanode</name> 
        
             <value>1</value> 
        
             <tag>HDDS,SCM,OPERATION</tag> 
        
             <description>Minimum DataNodes which should be registered to get SCM out of 
        
               safe mode. 
        
             </description> 
        
           </property>

myskov

Thanks @kostacie for the patch. I think we should also update the default value in docs like hadoop-hdds/docs/content/concept/StorageContainerManager.md

myskov · 2025-04-25T12:50:59Z

then we should display a warn message to the user to update the "hdds.scm.safemode.min.datanode" to 1.

I'm not sure we should ever suggest user to set hdds.scm.safemode.min.datanode to 1. Even README guide suggest running 3 datanodes

cd compose/ozone
docker-compose up -d --scale datanode=3

myskov · 2025-04-25T12:53:22Z

hadoop-hdds/docs/content/concept/StorageContainerManager.md

 ozone.scm.container.size | 5GB | Default container size used by Ozone
 ozone.scm.block.size | 256MB |  The default size of a data block.
-hdds.scm.safemode.min.datanode | 1 | Minimum number of datanodes to start the real work.
+hdds.scm.safemode.min.datanode | 3 | Minimum number of datanodes to start the real work.


There's also a Chinese version of the doc. Could you please make changes there too?

myskov · 2025-04-25T12:53:44Z

hadoop-hdds/docs/content/concept/StorageContainerManager.md

+hdds.scm.safemode.min.datanode | 3 | Minimum number of datanodes to start the real work.
 ozone.scm.http-address | 0.0.0.0:9876 | HTTP address of the SCM server
-ozone.metadata.dirs | none | Directory to store persisted data (RocksDB).
+ozone.metadata.dirs | none | Directory to store persisted data (RocksDB).


please revert irrelevant change

For some reason I can't revert that as it changes itself and adds an extra line.

aryangupta1998 · 2025-04-25T12:57:09Z

I'm not sure we should ever suggest user to set hdds.scm.safemode.min.datanode to 1. Even README guide suggest running 3 datanodes

I meant to say we should only suggest when there's only one DN, because now we are setting the default value to 3 and if we have one DN, then SCM will not comeout of safemode until user sets hdds.scm.safemode.min.datanode to 1.

adoroszlai · 2025-04-27T05:59:26Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java

+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;


Please use org.slf4j.Logger.

adoroszlai · 2025-04-27T06:04:40Z

Basic Freon smoketest for one datanode                                | FAIL |
255 != 0
--
Single Node :: Smoketest for one datanode                             | FAIL |
1 test, 0 passed, 1 failed
--
ERROR: Test execution of ozonescripts/test.sh is FAILED!!!!

Need to set hdds.scm.safemode.min.datanode=1 in ozonescripts:

ozone/hadoop-ozone/dist/src/main/compose/ozonescripts/docker-config

Line 26 in cf1fb88

OZONE-SITE.XML_ozone.server.default.replication=1

Please also check unit test failures:

org.apache.hadoop.hdds.scm.safemode.TestHealthyPipelineSafeModeRule
org.apache.hadoop.hdds.scm.safemode.TestSCMSafeModeManager

adoroszlai · 2025-04-29T14:24:13Z

@aryangupta1998 @myskov would you like to take another look?

aryangupta1998

Thanks for the patch, @kostacie, and to @adoroszlai, @myskov, and @sarvekshayr for the reviews!

nandakumar131 · 2025-04-30T10:26:30Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java

+    int healthyCount = nodeManager.getNodes(NodeStatus.inServiceHealthy()).size();
+    int healthyReadOnlyCount = nodeManager.getNodes(NodeStatus.inServiceHealthyReadOnly()).size();
+    int staleCount = nodeManager.getNodes(NodeStatus.inServiceStale()).size();
+
+    if (healthyCount + healthyReadOnlyCount + staleCount == 1) {
+      LOG.warn("Only one Datanode is available in the cluster. " +
+          "Consider setting 'hdds.scm.safemode.min.datanode=1' in the configuration.");
+    }


SCM doesn't remember Datanode list across SCM restart. So during SCM restart, initially the datanode list in SCM will be empty and at some point the first Datanode will register making (healthyCount + healthyReadOnlyCount + staleCount == 1) to true and print the warn log message.

We will end up printing this warn log message during every SCM start-up, which is incorrect.

Thank you for the review @nandakumar131. Should we add a flag that would indicate that the log message has already been printed? Would it be a good solution?
Or would you recommend anything to avoid such problem?

I guess we can remove the warning, unless @nandakumar131 has some other suggestion.

We don't need any logging for this.

adoroszlai

Thanks @kostacie for updating the patch.

@nandakumar131 would you like to take another look?

nandakumar131 · 2025-05-21T17:14:29Z

Thanks @kostacie for the contribution.
Thanks to @aryangupta1998, @myskov & @adoroszlai for the reviews.

…e#8331)

…239-container-reconciliation Commits: 80 commits 5e273a4 HDDS-12977. Fail build on dependency problems (apache#8574) 5081ba2 HDDS-13034. Refactor DirectoryDeletingService to use ReclaimableDirFilter and ReclaimableKeyFilter (apache#8546) e936e4d HDDS-12134. Implement Snapshot Cache lock for OM Bootstrap (apache#8474) 31d13de HDDS-13165. [Docs] Python client developer guide. (apache#8556) 9e6955e HDDS-13205. Bump common-custom-user-data-maven-extension to 2.0.3 (apache#8581) 750b629 HDDS-13203. Bump Bouncy Castle to 1.81 (apache#8580) ba5177e HDDS-13202. Bump build-helper-maven-plugin to 3.6.1 (apache#8579) 07ee5dd HDDS-13204. Bump awssdk to 2.31.59 (apache#8582) e1964f2 HDDS-13201. Bump jersey2 to 2.47 (apache#8578) 81295a5 HDDS-13013. [Snapshot] Add metrics and tests for snapshot operations. (apache#8436) b3d75ab HDDS-12976. Clean up unused dependencies (apache#8521) e0f08b2 HDDS-13179. rename-generated-config fails on re-compile without clean (apache#8569) f388317 HDDS-12554. Support callback on completed reconfiguration (apache#8391) c13a3fe HDDS-13154 Link more Grafana dashboard json files to the Observability user doc (apache#8533) 2a761f7 HDDS-11967. [Docs]DistCP Integration in Kerberized environment. (apache#8531) 81fc4c4 HDDS-12550. Use DatanodeID instead of UUID in NodeManager CommandQueue. (apache#8560) 2360af4 HDDS-13169. Intermittent failure in testSnapshotOperationsNotBlockedDuringCompaction (apache#8553) f19789d HDDS-13170. Reclaimable filter should always reclaim entries when buckets and volumes have already been deleted (apache#8551) 315ef20 HDDS-13175. Leftover reference to OM-specific trash implementation (apache#8563) 902e715 HDDS-13159. Refactor KeyManagerImpl for getting deleted subdirectories and deleted subFiles (apache#8538) 46a93d0 HDDS-12817. Addendum rename ecIndex to replicaIndex in chunkinfo output (apache#8552) 19b9b9c HDDS-13166. Set pipeline ID in BlockExistenceVerifier to avoid cached pipeline with different node (apache#8549) b3ff67c HDDS-13068. Validate Container Balancer move timeout and replication timeout configs (apache#8490) 7a7b9a8 HDDS-13139. Introduce bucket layout flag in freon rk command (apache#8539) 3c25e7d HDDS-12595. Add verifier for container replica states (apache#8422) 6d59220 HDDS-13104. Move auditparser acceptance test under debug (apache#8527) 8e8c432 HDDS-13071. Documentation for Container Replica Debugger Tool (apache#8485) 0e8c8d4 HDDS-13158. Bump junit to 5.13.0 (apache#8537) 8e552b4 HDDS-13157. Bump exec-maven-plugin to 3.5.1 (apache#8534) 168f690 HDDS-13155. Bump jline to 3.30.4 (apache#8535) cc1e4d1 HDDS-13156. Bump awssdk to 2.31.54 (apache#8536) 3bfb7af HDDS-13136. KeyDeleting Service should not run for already deep cleaned snapshots (apache#8525) 006e691 HDDS-12503. Compact snapshot DB before evicting a snapshot out of cache (apache#8141) 568b228 HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past (apache#8491) 53673c5 HDDS-11244. OmPurgeDirectoriesRequest should clean up File and Directory tables of AOS for deleted snapshot directories (apache#8509) 07f4868 HDDS-13099. ozone admin datanode list ignores --json flag when --id filter is used (apache#8500) 08c0ab8 HDDS-13075. Fix default value in description of container placement policy configs (apache#8511) 58c87a8 HDDS-12177. Set runtime scope where missing (apache#8513) 10c470d HDDS-12817. Add EC block index in the ozone debug replicas chunk-info (apache#8515) 7027ab7 HDDS-13124. Respect config hdds.datanode.use.datanode.hostname when reading from datanode (apache#8518) b8b226c HDDS-12928. datanode min free space configuration (apache#8388) fd3d70c HDDS-13026. KeyDeletingService should also delete RenameEntries (apache#8447) 4c1c6cf HDDS-12714. Create acceptance test framework for debug and repair tools (apache#8510) fff80fc HDDS-13118. Remove duplicate mockito-core dependency from hdds-test-utils (apache#8508) 10d5555 HDDS-13115. Bump awssdk to 2.31.50 (apache#8505) 360d139 HDDS-13017. Fix warnings due to non-test scoped test dependencies (apache#8479) 1db1cca HDDS-13116. Bump jline to 3.30.3 (apache#8504) 322ca93 HDDS-13025. Refactor KeyDeletingService to use ReclaimableKeyFilter (apache#8450) 988b447 HDDS-5287. Document S3 ACL classes (apache#8501) 64bb29d HDDS-12777. Use module-specific name for generated config files (apache#8475) 54ed115 HDDS-9210. Update snapshot chain restore test to incorporate snapshot delete. (apache#8484) 87dfa5a HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance (apache#8438) 7cdc865 HDDS-13100. ozone admin datanode list --json should output a newline at the end (apache#8499) 9cc4194 HDDS-13089. [snapshot] Add an integration test to verify snapshotted data can be read by S3 SDK client (apache#8495) cb9867b HDDS-13065. Refactor SnapshotCache to return AutoCloseSupplier instead of ReferenceCounted (apache#8473) a88ff71 HDDS-10979. Support STANDARD_IA S3 storage class to accept EC replication config (apache#8399) 6ec8f85 HDDS-13080. Improve delete metrics to show number of timeout DN command from SCM (apache#8497) 3bb8858 HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 (apache#8331) 0171bef HDDS-13073. Set pipeline ID in checksums verifier to avoid cached pipeline with different node (apache#8480) 5c7726a HDDS-11539. OzoneClientCache `@PreDestroy` is never called (apache#8493) a8ed19b HDDS-13031. Implement a Flat Lock resource in OzoneManagerLock (apache#8446) e9e8b30 HDDS-12935. Support unsigned chunked upload with STREAMING-UNSIGNED-PAYLOAD-TRAILER (apache#8366) 7590268 HDDS-13079. Improve logging in DN for delete operation. (apache#8489) 435fe7e HDDS-12870. Fix listObjects corner cases (apache#8307) eb5dabd HDDS-12926. Remove *.tmp.* exclusion in DU (apache#8486) eeb98c7 HDDS-13030. Snapshot Purge should unset deep cleaning flag for next 2 snapshots in the chain (apache#8451) 6bf121e HDDS-13032. Support proper S3OwnerId representation (apache#8478) 5d1b43d HDDS-13076. Refactor OzoneManagerLock class to rename Resource class to LeveledResource (apache#8482) bafe6d9 HDDS-13064. [snapshot] Add test coverage for SnapshotUtils.isBlockLocationInfoSame() (apache#8476) 7035846 HDDS-13040. Add user doc highlighting the difference between Ozone ACL and S3 ACL. (apache#8457) 1825cdf HDDS-13049. Deprecate VolumeName & BucketName in OmKeyPurgeRequest and prevent Key version purge on Block Deletion Failure (apache#8463) 211c76c HDDS-13060. Change NodeManager.addDatanodeCommand(..) to use DatanodeID (apache#8471) f410238 HDDS-13061. Add test for key ACL operations without permission (apache#8472) d1a2f48 HDDS-13057. Increment block delete processed transaction counts regardless of log level (apache#8466) 0cc6fcc HDDS-13043. Replace != with assertNotEquals in TestSCMContainerPlacementRackAware (apache#8470) e1c779a HDDS-13051. Use DatanodeID in server-scm. (apache#8465) 35e1126 HDDS-13042. [snapshot] Add future proofing test cases for unsupported file system API (apache#8458) 619c05d HDDS-13008. Exclude same SST files when calculating full snapdiff (apache#8423) 21b49d3 HDDS-12965. Fix warnings about "used undeclared" dependencies (apache#8468) 8136119 HDDS-13048. Create new module for Recon integration tests (apache#8464) Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java

Change default hdds.scm.safemode.min.datanode to 3

96d71a8

aryangupta1998 reviewed Apr 25, 2025

View reviewed changes

aryangupta1998 requested review from errose28 and nandakumar131 April 25, 2025 08:58

sarvekshayr reviewed Apr 25, 2025

View reviewed changes

myskov reviewed Apr 25, 2025

View reviewed changes

Add warn log, fix doc

d23c395

myskov reviewed Apr 25, 2025

View reviewed changes

Fix doc

f4b3218

adoroszlai reviewed Apr 27, 2025

View reviewed changes

Fix tests and config

0be4df6

adoroszlai requested review from aryangupta1998 and myskov April 29, 2025 08:28

aryangupta1998 approved these changes Apr 29, 2025

View reviewed changes

nandakumar131 requested changes Apr 30, 2025

View reviewed changes

kostacie added 3 commits May 21, 2025 11:26

Delete warn message, revert changes

f683f37

Clean up

3efcf26

Delete unused test, revert changes

e53e08d

adoroszlai approved these changes May 21, 2025

View reviewed changes

adoroszlai requested a review from nandakumar131 May 21, 2025 16:14

nandakumar131 approved these changes May 21, 2025

View reviewed changes

nandakumar131 merged commit 3bb8858 into apache:master May 21, 2025
43 checks passed

Tejaskriya pushed a commit to Tejaskriya/ozone that referenced this pull request May 29, 2025

HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 (apach…

15e8be6

…e#8331)

kostacie deleted the hdds-12378_changing_dafault_safemode_min_datanode_value branch June 19, 2025 08:49

	<property>
	<name>hdds.scm.safemode.min.datanode</name>
	<value>1</value>
	<tag>HDDS,SCM,OPERATION</tag>
	<description>Minimum DataNodes which should be registered to get SCM out of
	safe mode.
	</description>
	</property>

		import org.apache.logging.log4j.LogManager;
		import org.apache.logging.log4j.Logger;

HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 #8331

HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 #8331

Uh oh!

Conversation

kostacie commented Apr 24, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

sarvekshayr left a comment

Choose a reason for hiding this comment

Uh oh!

myskov left a comment

Choose a reason for hiding this comment

Uh oh!

myskov commented Apr 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aryangupta1998 commented Apr 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 27, 2025

Uh oh!

adoroszlai commented Apr 29, 2025

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

nandakumar131 Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nandakumar131 commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nandakumar131 Apr 30, 2025 •

edited

Loading