Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

@sarvekshayr sarvekshayr commented Mar 4, 2025

What changes were proposed in this pull request?

#8391 adds support for invoking a callback once reconfiguration completes, allowing background services to be updated dynamically. Few more deletion related configurations are now dynamically reconfigurable without requiring a restart.

OM:

  • ozone.thread.number.dir.deletion

DATANODE:

  • ozone.block.deleting.service.interval
  • ozone.block.deleting.service.timeout

What is the link to the Apache JIRA

HDDS-11513

How was this patch tested?

OM:

bash-5.1$ ozone admin reconfig --service=OM --address=om:9862 status
OM: Reconfiguring status for node [om:9862]: started at Tue Jun 10 07:19:01 UTC 2025 and finished at Tue Jun 10 07:19:01 UTC 2025.
SUCCESS: Changed property ozone.thread.number.dir.deletion
        From: "15"
        To: "19"
SUCCESS: Changed property ozone.directory.deleting.service.interval
        From: "2m"
        To: "3m"

Logs

2025-06-10 07:19:01,307 [Reconfiguration Task] INFO conf.ReconfigurableBase: Starting reconfiguration task.
2025-06-10 07:19:01,314 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.directory.deleting.service.interval from "2m" to "3m".
2025-06-10 07:19:01,315 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.thread.number.dir.deletion from "15" to "19".
2025-06-10 07:19:01,315 [Reconfiguration Task] INFO service.DirectoryDeletingService: Updating and restarting DirectoryDeletingService with interval 180 seconds and core pool size 19
2025-06-10 07:19:01,315 [Reconfiguration Task] INFO utils.BackgroundService: Shutting down service DirectoryDeletingService
2025-06-10 07:19:01,317 [Reconfiguration Task] INFO utils.BackgroundService: Starting service DirectoryDeletingService with interval 180 seconds
2025-06-10 07:19:01,317 [Reconfiguration Task] INFO conf.ReconfigurationHandler: Reconfiguration completed with 2 updated properties.

DN:

bash-5.1$ ozone admin reconfig --service=DATANODE --address=datanode:19864 status
DN: Reconfiguring status for node [datanode:19864]: started at Tue Jun 10 07:19:44 UTC 2025 and finished at Tue Jun 10 07:19:44 UTC 2025.
SUCCESS: Changed property ozone.block.deleting.service.timeout
        From: "300000ms"
        To: "2m"
SUCCESS: Changed property ozone.block.deleting.service.interval
        From: "1m"
        To: "3m"
SUCCESS: Changed property ozone.block.deleting.service.workers
        From: "10"
        To: "30"

Logs

2025-06-10 07:19:44,896 [Reconfiguration Task] INFO conf.ReconfigurableBase: Starting reconfiguration task.
2025-06-10 07:19:44,907 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.block.deleting.service.timeout from "300000ms" to "2m".
2025-06-10 07:19:44,908 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.block.deleting.service.workers from "10" to "30".
2025-06-10 07:19:44,908 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.block.deleting.service.interval from "1m" to "3m".
2025-06-10 07:19:44,908 [Reconfiguration Task] INFO impl.BlockDeletingService: Updating and restarting BlockDeletingService with interval 180 seconds, core pool size 30 and timeout 120000000000 nanoseconds
2025-06-10 07:19:44,908 [Reconfiguration Task] INFO utils.BackgroundService: Shutting down service BlockDeletingService
2025-06-10 07:19:44,908 [Reconfiguration Task] INFO utils.BackgroundService: BlockDeletingService timeout is 120000000000 nanoseconds
2025-06-10 07:19:44,909 [Reconfiguration Task] INFO utils.BackgroundService: Starting service BlockDeletingService with interval 180 seconds
2025-06-10 07:19:44,909 [Reconfiguration Task] INFO conf.ReconfigurationHandler: Reconfiguration completed with 3 updated properties.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarvekshayr for the patch.

Making properties reconfigurable is a good first step, but the services do not pick up the updated configurations. dirDeletingServiceInterval etc. are only used by the tests.

@aryangupta1998
Copy link
Contributor

@adoroszlai, the dirDeletingServiceInterval(ozone.directory.deleting.service.interval) and blockDeletingServiceInterval(ozone.block.deleting.service.interval) are used by the background services to determine the intervals between their execution. These intervals, apart from being used in tests, are configured for the background services that delete directories and blocks. The background service uses scheduleWithFixedDelay to schedule the tasks with the specified intervals.
public void start() { exec.scheduleWithFixedDelay(service, 0, interval, unit); }

@adoroszlai
Copy link
Contributor

BlockDeletingService is created with interval and timeout taken from config:

Duration blockDeletingSvcInterval = dnConf.getBlockDeletionInterval();
long blockDeletingServiceTimeout = config
.getTimeDuration(OZONE_BLOCK_DELETING_SERVICE_TIMEOUT,
OZONE_BLOCK_DELETING_SERVICE_TIMEOUT_DEFAULT,
TimeUnit.MILLISECONDS);
int blockDeletingServiceWorkerSize = config
.getInt(OZONE_BLOCK_DELETING_SERVICE_WORKERS,
OZONE_BLOCK_DELETING_SERVICE_WORKERS_DEFAULT);
blockDeletingService =
new BlockDeletingService(this, blockDeletingSvcInterval.toMillis(),
blockDeletingServiceTimeout, TimeUnit.MILLISECONDS,
blockDeletingServiceWorkerSize, config,
datanodeDetails.threadNamePrefix(),
context.getParent().getReconfigurationHandler());

They are stored in the base BackgroundService class:

public BackgroundService(String serviceName, long interval,
TimeUnit unit, int threadPoolSize, long serviceTimeout,
String threadNamePrefix) {
this.interval = interval;
this.unit = unit;
this.serviceName = serviceName;
this.serviceTimeoutInNanos = TimeDuration.valueOf(serviceTimeout, unit)
.toLong(TimeUnit.NANOSECONDS);

which schedules the task when started:

public void start() {
exec.scheduleWithFixedDelay(service, 0, interval, unit);

Reconfiguring does not change values in BackgroundService (they are final), and even updating interval wouldn't change the task already scheduled with fixed delay (it needs to be rescheduled).

@aryangupta1998
Copy link
Contributor

Thanks for the clarification, @adoroszlai. I checked the code, for these configs to be updated, we need to restart the OM. The "ozone om" command starts the key manager, which then initializes the block deleting and directory deleting service with their intervals. So, simply updating these properties without starting the OM again won't be effective here.

@adoroszlai
Copy link
Contributor

simply updating these properties without starting the OM again won't be effective here

That's my point. OM can be tweaked to pick up the config without restart, but the current patch is not enough for that yet.

@adoroszlai adoroszlai marked this pull request as draft March 4, 2025 18:06
@sarvekshayr sarvekshayr marked this pull request as ready for review June 6, 2025 11:27
@sarvekshayr
Copy link
Contributor Author

@sumitagrawl @aryangupta1998 @Tejaskriya Please review this PR.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarvekshayr Thanks for working over this, few comments given

@sumitagrawl sumitagrawl requested a review from Copilot June 9, 2025 11:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds dynamic reconfiguration support for deletion-related settings in OM and DataNode so they can be changed at runtime without restarting.

  • OM now supports changing the number of directory-deletion threads (ozone.thread.number.dir.deletion).
  • DataNode supports changing block-deletion interval and timeout (ozone.block.deleting.service.interval and ozone.block.deleting.service.timeout) at runtime.
  • BackgroundService timeout is now mutable, and integration tests are updated to cover the new OM reconfig.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java Register and implement reconfOzoneThreadNumberDirDeletion callback
integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestOmReconfiguration.java Add test for thread-count reconfiguration in OM
integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestDatanodeReconfiguration.java Include block-deleting interval/timeout keys in expected reconfigurable set
container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java Register complete callback and implement updateAndRestart on interval change
container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java Register DN reconfig callbacks for interval and timeout and apply timeout
common/src/main/java/org/apache/hadoop/hdds/utils/BackgroundService.java Make serviceTimeoutInNanos mutable and add setter
Comments suppressed due to low confidence (3)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java:292

  • The reconfigBlockDeletingServiceInterval callback only updates the configuration key but does not apply the new interval to the BlockDeletingService at runtime. Consider invoking a service update (e.g., calling updateAndRestart on BlockDeletingService) or setting the interval directly so the change takes effect immediately.
.register(OZONE_BLOCK_DELETING_SERVICE_INTERVAL, this::reconfigBlockDeletingServiceInterval)

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java:5164

  • [nitpick] The method prefix reconf is inconsistent with the reconfig prefix used elsewhere (e.g., in HddsDatanodeService). Consider standardizing on a single naming convention for reconfiguration callbacks for clarity.
private String reconfOzoneThreadNumberDirDeletion(String newVal) {

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestDatanodeReconfiguration.java:55

  • There are currently no tests verifying that runtime reconfiguration of block-deleting interval and timeout actually affects the service behavior on the DataNode. Consider adding integration or unit tests to assert that the BlockDeletingService applies new values without restart.
.add(OZONE_BLOCK_DELETING_SERVICE_TIMEOUT)

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the patch @sarvekshayr!

@adoroszlai adoroszlai dismissed their stale review June 16, 2025 07:17

patch updated

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the patch @sarvekshayr, LGTM!

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl sumitagrawl merged commit 9bc53b2 into apache:master Jun 16, 2025
41 checks passed
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Jun 16, 2025
aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 20, 2025
…239-container-reconciliation

Commits: 62

da53b5b HDDS-13299. Fix failures related to delete (apache#8665)
8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662)
7329859 HDDS-13023. Container checksum is missing after container import (apache#8459)
a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657)
f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638)
e9c0a45 HDDS-13262. Simplify key name validation (apache#8619)
f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647)
b574709 HDDS-12924. datanode used space calculation optimization (apache#8365)
de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620)
97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639)
9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601)
d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648)
6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503)
a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645)
9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646)
48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630)
052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559)
90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633)
9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003)
ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617)
3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631)
7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575)
5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626)
77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621)
ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629)
36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628)
d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627)
3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623)
8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625)
a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624)
7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600)
fd13300 HDDS-10775. Support bucket ownership verification (apache#8558)
3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584)
ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554)
38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618)
f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590)
0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612)
ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616)
5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599)
1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614)
6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606)
4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596)
bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607)
6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305)
09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611)
931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605)
3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest
6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573)
752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349)
8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586)
ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602)
bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571)
0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592)
516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530)
5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598)
ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595)
e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562)
3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516)
f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547)
b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532)
203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583)
2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594)

Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants