Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented Jun 9, 2025

What changes were proposed in this pull request?

Trigger datanode heartbeat immediately when the container being written to is (close) to full, or volume is full, or container is unhealthy. The immediate heartbeat will contain close container action.

  • Changes made to HddsDispatcher.
  • Per-container throttling. Heartbeat will be triggered immediately once for that particular container. After that, container close container action can be sent in regular heartbeat if needed. This acts as a retry mechanism in case the immediate heartbeat is not processed/lost or any other runtime failure reason.
  • The volume full check is changed to reserved - available - min free <= 0. This fixes a bug.

See #8492 (comment) for full context and #8460 for a design discussion. The design discussion isn't fully updated yet, but both the above pull requests contain discussions about alternatives that we didn't decide to take.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13045

How was this patch tested?

Modified existing unit tests.

The PR is ready for review but marked as draft while CI is running in my fork.

I also did some manual testing using the docker compose ozone definition and some temporary changes to simulate a full volume.

Started with 150 MB available space and 100 MB min free space:

2025-06-09 07:27:33,660 [main] INFO volume.HddsVolume: HddsVolume: { id=DS-6f8612f1-b7b1-4914-854d-bdb1b25114ad dir=/data/hdds/hdds type=DISK capacity=2147268899 used=1990197248 available=157071651 minFree=104857600 committed=0 }

Wrote 150 MB of data, Datanode detected full volume/container and triggered a heartbeat:

 2025-06-09 07:28:07,530 [4cbef4c3-3e47-436e-809f-9282b2b1b1f9-EndpointStateMachineTaskThread-scm/172.18.0.6:9861-0 ] INFO endpoint.HeartbeatEndpointTask: Sending heartbeat message : containerActions {
   containerID: 1
   action: CLOSE
   reason: CONTAINER_FULL
 }
 
 2025-06-09 07:28:07,531 [4cbef4c3-3e47-436e-809f-9282b2b1b1f9-EndpointStateMachineTaskThread-recon/172.18.0.5:9891-0 ] INFO endpoint.HeartbeatEndpointTask: Sending heartbeat message : containerActions {
   containerID: 1
   action: CLOSE
   reason: CONTAINER_FULL
 }
 
 2025-06-09 07:28:07,882 [4cbef4c3-3e47-436e-809f-9282b2b1b1f9-ContainerOp-99fd4123-d677-4ed0-b872-d86353836135-4] WARN impl.HddsDispatcher: Volume [/data/hdds/hdds] is full. containerID: 1. Volume usage: [capacity=2147268899, used=2042626048, available=104642851].

SCM received the heartbeat at the same time:

2025-06-09 07:28:07,534 [IPC Server handler 59 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Container Actions. containerActions {
   containerID: 1
   action: CLOSE
   reason: CONTAINER_FULL
 }
 
 2025-06-09 07:28:07,535 [IPC Server handler 29 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Container Actions. containerActions {
   containerID: 1
   action: CLOSE
   reason: CONTAINER_FULL
 }
 
 2025-06-09 07:28:07,535 [EventQueue-CloseContainerForCloseContainerEventHandler] INFO container.CloseContainerEventHandler: Close container Event triggered for container : #1, current state: OPEN
 2025-06-09 07:28:07,535 [IPC Server handler 43 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Container Actions. containerActions {
   containerID: 1
   action: CLOSE
   reason: CONTAINER_FULL
 }

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan have minor comments, other LGTM

@sumitagrawl sumitagrawl requested a review from Copilot June 9, 2025 10:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the immediate triggering of a datanode heartbeat when a container or volume is full by modifying the close container action logic and updating tests to verify the per-container throttling behavior. Key changes include:

  • Adding immediate heartbeat logic using an AtomicBoolean flag in ContainerData and HddsDispatcher.
  • Updating the volume full check in HddsVolume.
  • Enhancing unit tests in TestHddsDispatcher to verify heartbeat triggering and throttling.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/impl/TestHddsDispatcher.java Updated test scenarios to verify immediate heartbeat triggering and throttling behavior.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Updated volume full check logic.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java Introduced immediate heartbeat triggering logic and replaced volume full checking with an adjusted method.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java Added an AtomicBoolean to track immediate close action triggering.
Comments suppressed due to low confidence (2)

hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/impl/TestHddsDispatcher.java:312

  • The increment value has been changed from 50 to 60 bytes; please update the accompanying comment to explain why the increment was adjusted so that the test scenario accurately reflects the volume full condition.
containerData.getVolume().getVolumeUsage().ifPresent(usage -> usage.incrementUsedSpace(60));

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java:629

  • [nitpick] The method name is quite verbose; consider using a more concise name (e.g., isVolumeFullAdjusted) that still conveys that committed space is excluded from the check.
private boolean isVolumeFullExcludingCommittedSpace(Container container) {

@siddhantsangwan
Copy link
Contributor Author

In the latest commit I added verification in the unit tests that heartbeat throttling is done per container.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ContainerAction.Reason.CONTAINER_UNHEALTHY;
ContainerAction action = ContainerAction.newBuilder()
.setContainerID(containerData.getContainerID())
.setAction(ContainerAction.Action.CLOSE).setReason(reason).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are closing the container because of full volume, we can set the reason to VOLUME_FULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 Thanks for the review! Since VOLUME_FULL isn't strictly required, in the interest of time let's merge this PR? I'd like to avoid making a proto change in this one. We can make the improvement in a future PR if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can do that as a followup.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for the patch.

Please consider using StorageLocationReport to capture all volume information in one step, and use it for both calculation and logging for consistency.

Example:

StorageLocationReport volumeReport = volume.getReport();
boolean full = volumeReport.getUsableSpace() <= 0;
if (full) {
LOG.info("Container {} volume is full: {}", container.getContainerData().getContainerID(), volumeReport);
}

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
#	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/impl/TestHddsDispatcher.java
@adoroszlai adoroszlai merged commit f31c264 into apache:master Jun 13, 2025
41 checks passed
@adoroszlai
Copy link
Contributor

Thanks @siddhantsangwan for the patch, @nandakumar131, @sumitagrawl for the review.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jun 13, 2025
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Jun 16, 2025
aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 20, 2025
…239-container-reconciliation

Commits: 62

da53b5b HDDS-13299. Fix failures related to delete (apache#8665)
8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662)
7329859 HDDS-13023. Container checksum is missing after container import (apache#8459)
a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657)
f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638)
e9c0a45 HDDS-13262. Simplify key name validation (apache#8619)
f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647)
b574709 HDDS-12924. datanode used space calculation optimization (apache#8365)
de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620)
97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639)
9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601)
d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648)
6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503)
a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645)
9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646)
48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630)
052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559)
90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633)
9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003)
ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617)
3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631)
7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575)
5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626)
77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621)
ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629)
36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628)
d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627)
3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623)
8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625)
a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624)
7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600)
fd13300 HDDS-10775. Support bucket ownership verification (apache#8558)
3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584)
ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554)
38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618)
f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590)
0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612)
ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616)
5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599)
1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614)
6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606)
4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596)
bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607)
6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305)
09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611)
931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605)
3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest
6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573)
752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349)
8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586)
ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602)
bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571)
0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592)
516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530)
5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598)
ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595)
e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562)
3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516)
f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547)
b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532)
203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583)
2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594)

Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
…hen Volume Full (apache#8590)

(cherry picked from commit f31c264)

Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/impl/TestHddsDispatcher.java

1. Added the code line `String usage = volume.getVolumeInfo().map(info -> info.getCurrentUsage().toString()).orElse("none");` to HddsDispatcher#sendCloseContainerActionIfNeeded to get the `SpaceUsageSource` object.
2. Similarly add `getVolumeInfo().isPresent()` to HddsVolume#isVolumeFull.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants