Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions hadoop-hdds/docs/content/design/full-volume-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Full Volume Handling
summary: Immediately trigger Datanode heartbeat on detecting full volume
date: 2025-05-12
jira: HDDS-12929
status: Design
author: Siddhant Sangwan, Sumit Agrawal
---

<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

## Summary
On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage report.

## Problem
When a Datanode volume is close to full, the SCM may not be immediately aware because storage reports are only sent
to it every thirty seconds. This can lead to the SCM allocating multiple blocks to containers on a full DN volume,
causing performance issues when the write fails. The proposal will partly solve this problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCM will check the container size before allocating block for this container. Currently the container size is reported when container report is full, or container state is changed from open to other state. So SCM is kindly allocating the blocks blindly. In this 30s storage reports, I think we should consider report the open containers too, to help SCM better understand the open container state and avoid over allocate blocks for one container. @siddhantsangwan , what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean include a full container report for all containers in the DN, not just the ones on the full volume? We can use the method StateContext#getFullContainerReportDiscardPendingICR.

Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open containers report, not full container report. As only open container will grow its size. Other container's size will just shrink if there is any change. And a timely open container size update will help SCM allocating block more precisely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan
When container is 90% full, we add ICR for closeContainer action to be send for next HB. This can be send immediately with this which is already present, and no need wait for next HB for these.
I think when sending ICR for this, it would have been added, can be verified.

@ChenSammi Since DN is already taking decision to stop block allocation when 90% is full and send is send, May be SCM sending open container list will not provide any further benefits as action is already taken by DN.

Need to see if any additional benefits we can having sending OpenContainer information also. This may be tracked to send Open Containers on every HB with separate JIRA and based on benefits, this can be implemented.


In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the min free space boundary in a volume. To prevent this from happening often, SCM needs to stop allocating blocks to containers on such volumes in the first place.

## Non Goals
The proposed solution describes the complete solution at a high level, however HDDS-12929 will only add the initial Datanode side code for triggering a heartbeat on detecting a full volume + throttling logic.

Failing the write if it exceeds the min free space boundary is not discussed here.

## Proposed Solution

### What does the Datanode do currently?

In HddsDispatcher, on detecting that the volume being written to is close to full, we add a CloseContainerAction for
that container. This is sent to the SCM in the next heartbeat and makes the SCM close that container. This reaction time
is OK for a container that is close to full, but not if the volume is close to full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full. This leaves the volume to container mapping inside the datanode without needing to add it to SCM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

Yes, and we will use the existing method in the current code to determine whether a volume is full. It accounts for committed space, min free space and reserved space.

  private boolean isVolumeFull(Container container) {
    boolean isOpen = Optional.ofNullable(container)
        .map(cont -> cont.getContainerState() == ContainerDataProto.State.OPEN)
        .orElse(Boolean.FALSE);
    if (isOpen) {
      HddsVolume volume = container.getContainerData().getVolume();
      StorageLocationReport volumeReport = volume.getReport();
      boolean full = volumeReport.getUsableSpace() <= 0;
      if (full) {
        LOG.info("Container {} volume is full: {}", container.getContainerData().getContainerID(), volumeReport);
      }
      return full;
    }
    return false;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full.

Yes, that's what I'm planning to do. However before sending the heartbeat we need to generate the latest storage report and add it to the heartbeat. Also, the CloseContainerAction is currently only for that one container. We either need to add an action for all containers on that volume, or send a list of all container ids in that volume in the heartbeat.

Actually one idea I've got from this convo is that adding an action for each container might be easier, as we already have a framework in-place in the SCM for handling these actions. Let me think a bit more about this. An immediate hurdle is that the number of actions that can be sent is also throttled as of now.


### Proposal
This is the proposal, explained via a diagram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the diagram adequately explains the proposal. When talking about proposed proto updates writing out code examples is helpful. Throttling implementation needs to be specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of the throttling implementation and even tried it out in code while thinking of the design, but I didn't specify it in the doc. I've added it to the design now. There's also a pull request which implements a part of this design, including throttling - #8492.

Please bear with me as I try to balance too much vs too less content in my designs!

When talking about proposed proto updates writing out code examples is helpful.

Still thinking about what information we can send over the wire. I'll add it once I've decided. There are a couple of high level ideas to prevent over allocating blocks to a container/volume in the SCM -

  1. Track how much allocation is done at the SCM, similar to used space and committed space at the DN. Proposed by @sumitagrawl.
  2. Send reports of open containers every 30 seconds to SCM, which @ChenSammi proposed. Also handle these reports so that any containers with size >= max size are closed.
  3. Decide which containers should be closed in the DN, and send CloseContainerAction for all these containers in the heartbeat (which we briefly discussed in a previous comment).

We may need to do some of these or a mix of all of these. As I think it through I'll add more info.


![full-volume-handling.png](../../static/full-volume-handling.png)

Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.
Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the activities that use DN volume space,
a. create a new container, reserved 5GB
b. write a new chunk
c. download an import a container, reserved 10GB
d. container metadata rocksdb, no reservation

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.
And when disk is full, DN will and is responsible for not allocate new container on this volume and pick volume as target volume for container import.

So overall my suggestion is
a. carry open container state in periodic storage report
b. when one disk is full, sent a full storage report immediately with open container state to SCM out of cycle.
c. make sure these kind of reports are get handled with priority in SCM. We may consider introduce a new port in SCM, for just DN heartbeat with storage report. Currently all reports are sent to one single port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

Agreed, I'm planning to send a full storage report containing info about all the volumes.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.

Good point. I'm planning to use HddsDispatcher#isVolumeFull for checking if a volume is full. This method ultimately checks whether available - committed - min free space <= 0. So if this method returns true, that means we only need to close containers whose size >= max size (5 GB). All containers on this volume need not be closed.


## Benefits
1. SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the SCM logic to allocate pipelines take into account the min free space setting? say volume reached 95GB/100GB disk space and triggered close for that container and sent volume report and SCM updated volume stats for the node. Does SCM ensure that it does not allocate containers at 95GB usage or it waits till 100GB? I believe SCM should be aware/ have a similar or more liberal config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it takes min free space into account

2. Close to full volumes won't cause frequent write failures.

## Alternatives
Instead of including the list of containers present on the full volume in the Storage Report, we could instead add the volume ID to the Container Replica proto. In the SCM, this would imply that we need to do a linear scan through all the Container Replica objects present in the system to figure out which containers are present on the full volume, which is slow. Alternatively we could build and maintain a map to do this, which is more complex than the proposed solution.

## Implementation Plan
1. HDDS-13045: Initial code for including node report, triggering heartbeat, throttling.
2. HDDS-12151: Fail a write call if it exceeds min free space boundary
3. Future Jira: Handle full volume report on the SCM side - close containers.
4. HDDS-12658: Try not to select full pipelines when allocating a block in SCM.
Binary file added hadoop-hdds/docs/static/full-volume-handling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.