HDDS-12929. Datanode Should Immediately Trigger Container Close when Volume Full #8460

siddhantsangwan · 2025-05-15T06:26:37Z

What changes were proposed in this pull request?

This is a design discussion for HDDS-12929. Please see the changes for all the details.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12929

How was this patch tested?

Rendered the markdown in IntelliJ, it looked fine.

…Volume Full

siddhantsangwan · 2025-05-16T11:11:08Z

@peterxcli you may also be interested in reviewing this. For some reason your username doesn't appear in the "Request reviewers" UI.

errose28 · 2025-05-16T15:43:50Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+In HddsDispatcher, on detecting that the volume being written to is close to full, we add a CloseContainerAction for
+that container. This is sent to the SCM in the next heartbeat and makes the SCM close that container. This reaction time
+ is OK for a container that is close to full, but not if the volume is close to full.


How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full. This leaves the volume to container mapping inside the datanode without needing to add it to SCM.

How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

Yes, and we will use the existing method in the current code to determine whether a volume is full. It accounts for committed space, min free space and reserved space.

private boolean isVolumeFull(Container container) { boolean isOpen = Optional.ofNullable(container) .map(cont -> cont.getContainerState() == ContainerDataProto.State.OPEN) .orElse(Boolean.FALSE); if (isOpen) { HddsVolume volume = container.getContainerData().getVolume(); StorageLocationReport volumeReport = volume.getReport(); boolean full = volumeReport.getUsableSpace() <= 0; if (full) { LOG.info("Container {} volume is full: {}", container.getContainerData().getContainerID(), volumeReport); } return full; } return false; }

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full.

Yes, that's what I'm planning to do. However before sending the heartbeat we need to generate the latest storage report and add it to the heartbeat. Also, the CloseContainerAction is currently only for that one container. We either need to add an action for all containers on that volume, or send a list of all container ids in that volume in the heartbeat.

Actually one idea I've got from this convo is that adding an action for each container might be easier, as we already have a framework in-place in the SCM for handling these actions. Let me think a bit more about this. An immediate hurdle is that the number of actions that can be sent is also throttled as of now.

errose28 · 2025-05-16T15:57:53Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+ is OK for a container that is close to full, but not if the volume is close to full.
+
+### Proposal
+This is the proposal, explained via a diagram.


I don't think the diagram adequately explains the proposal. When talking about proposed proto updates writing out code examples is helpful. Throttling implementation needs to be specified.

I had thought of the throttling implementation and even tried it out in code while thinking of the design, but I didn't specify it in the doc. I've added it to the design now. There's also a pull request which implements a part of this design, including throttling - #8492.

Please bear with me as I try to balance too much vs too less content in my designs!

When talking about proposed proto updates writing out code examples is helpful.

Still thinking about what information we can send over the wire. I'll add it once I've decided. There are a couple of high level ideas to prevent over allocating blocks to a container/volume in the SCM -

Track how much allocation is done at the SCM, similar to used space and committed space at the DN. Proposed by @sumitagrawl.

Send reports of open containers every 30 seconds to SCM, which @ChenSammi proposed. Also handle these reports so that any containers with size >= max size are closed.

Decide which containers should be closed in the DN, and send CloseContainerAction for all these containers in the heartbeat (which we briefly discussed in a previous comment).

We may need to do some of these or a mix of all of these. As I think it through I'll add more info.

ChenSammi · 2025-05-19T07:23:32Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+## Problem
+When a Datanode volume is close to full, the SCM may not be immediately aware because storage reports are only sent 
+to it every thirty seconds. This can lead to the SCM allocating multiple blocks to containers on a full DN volume, 
+causing performance issues when the write fails. The proposal will partly solve this problem.


SCM will check the container size before allocating block for this container. Currently the container size is reported when container report is full, or container state is changed from open to other state. So SCM is kindly allocating the blocks blindly. In this 30s storage reports, I think we should consider report the open containers too, to help SCM better understand the open container state and avoid over allocate blocks for one container. @siddhantsangwan , what do you think?

So you mean include a full container report for all containers in the DN, not just the ones on the full volume? We can use the method StateContext#getFullContainerReportDiscardPendingICR.

Open containers report, not full container report. As only open container will grow its size. Other container's size will just shrink if there is any change. And a timely open container size update will help SCM allocating block more precisely.

@siddhantsangwan
When container is 90% full, we add ICR for closeContainer action to be send for next HB. This can be send immediately with this which is already present, and no need wait for next HB for these.
I think when sending ICR for this, it would have been added, can be verified.

@ChenSammi Since DN is already taking decision to stop block allocation when 90% is full and send is send, May be SCM sending open container list will not provide any further benefits as action is already taken by DN.

Need to see if any additional benefits we can having sending OpenContainer information also. This may be tracked to send Open Containers on every HB with separate JIRA and based on benefits, this can be implemented.

ChenSammi · 2025-05-19T08:04:17Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+
+![full-volume-handling.png](../../static/full-volume-handling.png)
+
+Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.


These are the activities that use DN volume space,
a. create a new container, reserved 5GB
b. write a new chunk
c. download an import a container, reserved 10GB
d. container metadata rocksdb, no reservation

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.
And when disk is full, DN will and is responsible for not allocate new container on this volume and pick volume as target volume for container import.

So overall my suggestion is
a. carry open container state in periodic storage report
b. when one disk is full, sent a full storage report immediately with open container state to SCM out of cycle.
c. make sure these kind of reports are get handled with priority in SCM. We may consider introduce a new port in SCM, for just DN heartbeat with storage report. Currently all reports are sent to one single port.

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

Agreed, I'm planning to send a full storage report containing info about all the volumes.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.

Good point. I'm planning to use HddsDispatcher#isVolumeFull for checking if a volume is full. This method ultimately checks whether available - committed - min free space <= 0. So if this method returns true, that means we only need to close containers whose size >= max size (5 GB). All containers on this volume need not be closed.

sadanand48 · 2025-05-21T08:08:04Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.
+
+## Benefits
+1.  SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.


Does the SCM logic to allocate pipelines take into account the min free space setting? say volume reached 95GB/100GB disk space and triggered close for that container and sent volume report and SCM updated volume stats for the node. Does SCM ensure that it does not allocate containers at 95GB usage or it waits till 100GB? I believe SCM should be aware/ have a similar or more liberal config.

Yes, it takes min free space into account

peterxcli · 2025-05-21T09:21:41Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+A: Last, regular heartbeat
+B: Volume 1 detected as full, heartbeat triggered
+C: Volume 1 again detected as full, heartbeat throttled
+D: Volume 2 detected as full, heartbeat throttled
+E: Volume 3 detected as full, heartbeat triggered (30 seconds after B) 


If vol 1 is full and has already been reported, but then vol 2 becomes full shortly after, the current global throttling would delay the notification about vol 2 to the SCM until the throttle interval passes. During this period, containers on vol 2 would continue to receive data from the SCM, which could lead to more write failures. I suggest throttling heartbeats independently for each volume, so that a full volume can be reported to the SCM as soon as it is detected, regardless of the status of other volumes.

I mostly agree with your suggestion. The only disadvantage I can think of is that DN may end up sending multiple node reports at the same time (around 20 reports considering 20 volumes in the DN) in the worst case, where each report anyway contains data about all the volumes.

We could make sure that only one ‘volume-full’ heartbeat is sent at any given time—if the feature is supported, which may require refactoring DatanodeStateMachine.

When triggering immediate heartbeat, we're sending storage reports for all volumes. To keep complexity low, I think that's good enough for now. Let's see how it works in prod and in a new jira we can do per-volume throttling if needed. What do you think?

Also tagging @ChenSammi who was discussing this here.

I'm fine with this~

Do you want to open a ticket for this?

siddhantsangwan · 2025-05-29T11:31:04Z

I discussed this design with several people in a meeting. Here are some of the important points:

Trying to prevent over allocation of blocks to a container is complicated. We could track how much space we've allocated to a container in the SCM - this is doable on the surface but won't actually work well. That's because SCM is asked for a block (256MB), but SCM doesn't know how much data a client will actually write to that block file. The client may only write 1MB, for example. So SCM could track that it has already allocate 5 GB to a container, and will open another container for incoming requests, but the client may actually only write 1GB. This would lead to a lot of open containers when we have 10k requests/second. (brought up @sumitagrawl and @ChenSammi).
Sending open container reports regularly (every 30 seconds for example) can help a little bit, but won't solve the problem. We won't take this approach for now. (discussed with @ChenSammi and @nandakumar131).
A problem out of scope of this design, but was discussed, is that Replication Manager + Placement Policy should not keep on selecting a close-to-full Datanode as the target for a container. https://issues.apache.org/jira/browse/HDDS-12998, which is related to this, was also briefly mentioned. This point should be analysed in a different jira. (brought up by @ashishkumar50).

In summary, the only change this design proposes is to immediately trigger a heartbeat containing storage reports of all volumes when the isVolumeFull check returns true. This will promptly update the volumes' storage stats in the SCM. The SCM can take these into account when creating new pipelines - there's already logic in the SCM for that.

This is the only remaining open point for discussion - #8460 (comment).

siddhantsangwan · 2025-05-29T13:13:07Z

Updated the doc to reflect the latest discussion, please take a look.

siddhantsangwan · 2025-05-30T08:49:12Z

@errose28 would you like to take another look?

peterxcli

LGTM!

peterxcli · 2025-06-02T12:00:41Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+A: Last, regular heartbeat
+B: Volume 1 detected as full, heartbeat triggered
+C: Volume 1 again detected as full, heartbeat throttled
+D: Volume 2 detected as full, heartbeat throttled
+E: Volume 3 detected as full, heartbeat triggered (30 seconds after B) 


Do you want to open a ticket for this?

peterxcli · 2025-06-02T15:48:53Z

hadoop-hdds/docs/content/design/full-volume-handling.md

-## Benefits
-1.  SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.
-2. Close to full volumes won't cause frequent write failures.
+## Preventing over allocation of blocks in the SCM


I guess you forgot to add a h2 "Alternatives" or "Rejected Approaches" here. And the "Preventing over allocation of blocks in the SCM" should be h3, just like "Regularly sending open container reports."

Good catch, fixed this in the latest commit.

errose28 · 2025-06-02T20:31:17Z

hadoop-hdds/docs/content/design/full-volume-handling.md

+because the client will have to request for a different set of blocks. We will discuss how we tried to solve this, 
+but ultimately decided to not go ahead with the solution.


I'm confused on the status of this doc. What part of it maps to what we actually plan to implement?

The design changed a lot. We've already merged the corresponding implementation - #8590, and I've updated this doc now.

This comment will tell you why and what changed - #8492 (comment)

This comment and the design doc itself will tell you what we implemented - #8590 (comment)

In short, we trigger heartbeat immediately including the close container action. Per container throttling. Not sending storage reports.

siddhantsangwan · 2025-06-23T07:44:27Z

I've updated this design doc to reflect the implementation that's already been merged - #8590

sumitagrawl

@siddhantsangwan Plz update design doc

hadoop-hdds/docs/content/design/full-volume-handling.md

… to alternatives

siddhantsangwan · 2025-06-23T10:39:30Z

@sumitagrawl thanks for the review, I've addressed your comments.

sumitagrawl

LGTM

…Volume Full (apache#8460)

siddhantsangwan added 4 commits May 15, 2025 11:19

HDDS-12929. Datanode Should Immediately Trigger Container Close when …

e1a649c

…Volume Full

edit markdown

7bf2678

fix date format

bfd7cb5

change image name

781731e

siddhantsangwan requested review from ChenSammi and sumitagrawl May 15, 2025 06:27

modify implementation plan

484e0d9

siddhantsangwan added the design label May 15, 2025

siddhantsangwan requested a review from ashishkumar50 May 16, 2025 11:09

errose28 reviewed May 16, 2025

View reviewed changes

ChenSammi reviewed May 19, 2025

View reviewed changes

sadanand48 reviewed May 21, 2025

View reviewed changes

siddhantsangwan added 2 commits May 21, 2025 14:06

add full volume definition + throttling explanation

198b5fb

Merge branch 'master' into HDDS-12929-design

ae34404

peterxcli reviewed May 21, 2025

View reviewed changes

Merge branch 'master' into HDDS-12929-design

4f70303

siddhantsangwan mentioned this pull request May 28, 2025

HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full #8492

Closed

update doc to reflect latest design discussion

1e4418d

siddhantsangwan marked this pull request as ready for review May 30, 2025 08:44

siddhantsangwan requested review from ChenSammi and peterxcli June 2, 2025 05:34

peterxcli approved these changes Jun 2, 2025

View reviewed changes

errose28 reviewed Jun 2, 2025

View reviewed changes

siddhantsangwan mentioned this pull request Jun 9, 2025

HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full #8590

Merged

Merge branch 'master' into HDDS-12929-design

65774c7

change doc to reflect latest discussion

1f7ec32

fix heading

a80879d

sumitagrawl reviewed Jun 23, 2025

View reviewed changes

emphasize committed space, remove most code, move all other solutions…

d57b13e

… to alternatives

siddhantsangwan requested a review from sumitagrawl July 7, 2025 05:40

sumitagrawl approved these changes Jul 7, 2025

View reviewed changes

sumitagrawl merged commit f727051 into apache:master Jul 7, 2025
14 checks passed

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025

HDDS-12929. Datanode Should Immediately Trigger Container Close when …

b00db6a

…Volume Full (apache#8460)


		![full-volume-handling.png](../../static/full-volume-handling.png)

		Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.

		because the client will have to request for a different set of blocks. We will discuss how we tried to solve this,
		but ultimately decided to not go ahead with the solution.

HDDS-12929. Datanode Should Immediately Trigger Container Close when Volume Full #8460

HDDS-12929. Datanode Should Immediately Trigger Container Close when Volume Full #8460

Uh oh!

Conversation

siddhantsangwan commented May 15, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

siddhantsangwan commented May 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented May 29, 2025

Uh oh!

siddhantsangwan commented May 29, 2025

Uh oh!

siddhantsangwan commented May 30, 2025

Uh oh!

peterxcli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterxcli Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jun 23, 2025

Uh oh!

ChenSammi May 19, 2025 •

edited

Loading

ChenSammi May 19, 2025 •

edited

Loading

peterxcli Jun 2, 2025 •

edited

Loading