Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions hadoop-hdds/docs/content/design/full-volume-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
title: Full Volume Handling
summary: Immediately trigger Datanode heartbeat on detecting full volume
date: 2025-05-12
jira: HDDS-12929
status: Design
author: Siddhant Sangwan, Sumit Agrawal
---

<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

## Summary
On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage
report for all volumes.

## Problem
When a Datanode volume is close to full, the SCM may not be immediately aware of this because storage reports are only
sent to it every one minute (`HDDS_NODE_REPORT_INTERVAL_DEFAULT = "60s"`). We would like SCM to know about this as
soon as possible, so it can make an informed decision when checking the volumes in that Datanode for deciding whether a
new pipeline can contain that Datanode (in `SCMCommonPlacementPolicy.hasEnoughSpace`).

Additionally, SCM only has stale
information about the current size of a container because container size is only updated when an Incremental Container
Report (event based, for example when a container transitions from open to closing state) is received or a Full
Container Report (`HDDS_CONTAINER_REPORT_INTERVAL_DEFAULT = "60m"`) is received. This can lead to the SCM
over-allocating blocks to containers on a full DN volume. When the writes eventually fail, performance will drop
because the client will have to request for a different set of blocks. We will discuss how we tried to solve this,
but ultimately decided to not go ahead with the solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused on the status of this doc. What part of it maps to what we actually plan to implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design changed a lot. We've already merged the corresponding implementation - #8590, and I've updated this doc now.

This comment will tell you why and what changed - #8492 (comment)

This comment and the design doc itself will tell you what we implemented - #8590 (comment)

In short, we trigger heartbeat immediately including the close container action. Per container throttling. Not sending storage reports.


### The definition of a full volume
A volume is considered full if the following (existing) method returns true.
```java
private boolean isVolumeFull(Container container) {
boolean isOpen = Optional.ofNullable(container)
.map(cont -> cont.getContainerState() == ContainerDataProto.State.OPEN)
.orElse(Boolean.FALSE);
if (isOpen) {
HddsVolume volume = container.getContainerData().getVolume();
StorageLocationReport volumeReport = volume.getReport();
boolean full = volumeReport.getUsableSpace() <= 0;
if (full) {
LOG.info("Container {} volume is full: {}", container.getContainerData().getContainerID(), volumeReport);
}
return full;
}
return false;
}
```

It accounts for available space, committed space, min free space and reserved space:
```java
private static long getUsableSpace(
long available, long committed, long minFreeSpace) {
return available - committed - minFreeSpace;
}
```

In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the
min free space boundary in a volume.

## Non Goals
The proposed solution describes the complete solution. HDDS-13045 will add the Datanode side code
for triggering a heartbeat on detecting a full volume + throttling logic.

Failing the write if it exceeds the min free space boundary (https://issues.apache.org/jira/browse/HDDS-12151) is not
discussed here.

## Proposed Solution

### What does the Datanode do currently when a volume is full?

In `HddsDispatcher`, on detecting that the volume being written to is full (as defined previously), we add a
`CloseContainerAction` for that container:

```java
private void sendCloseContainerActionIfNeeded(Container container) {
// We have to find a more efficient way to close a container.
boolean isSpaceFull = isContainerFull(container) || isVolumeFull(container);
boolean shouldClose = isSpaceFull || isContainerUnhealthy(container);
if (shouldClose) {
ContainerData containerData = container.getContainerData();
ContainerAction.Reason reason =
isSpaceFull ? ContainerAction.Reason.CONTAINER_FULL :
ContainerAction.Reason.CONTAINER_UNHEALTHY;
ContainerAction action = ContainerAction.newBuilder()
.setContainerID(containerData.getContainerID())
.setAction(ContainerAction.Action.CLOSE).setReason(reason).build();
context.addContainerActionIfAbsent(action);
}
}

```
This is sent to the SCM in the next heartbeat and makes the SCM close that
container. This reaction time is OK for a container that is close to full, but not if the volume is close to full.

### Proposal for immediately triggering Datanode heartbeat
This is the proposal, explained via a diagram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the diagram adequately explains the proposal. When talking about proposed proto updates writing out code examples is helpful. Throttling implementation needs to be specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of the throttling implementation and even tried it out in code while thinking of the design, but I didn't specify it in the doc. I've added it to the design now. There's also a pull request which implements a part of this design, including throttling - #8492.

Please bear with me as I try to balance too much vs too less content in my designs!

When talking about proposed proto updates writing out code examples is helpful.

Still thinking about what information we can send over the wire. I'll add it once I've decided. There are a couple of high level ideas to prevent over allocating blocks to a container/volume in the SCM -

  1. Track how much allocation is done at the SCM, similar to used space and committed space at the DN. Proposed by @sumitagrawl.
  2. Send reports of open containers every 30 seconds to SCM, which @ChenSammi proposed. Also handle these reports so that any containers with size >= max size are closed.
  3. Decide which containers should be closed in the DN, and send CloseContainerAction for all these containers in the heartbeat (which we briefly discussed in a previous comment).

We may need to do some of these or a mix of all of these. As I think it through I'll add more info.


![full_volume_handling.png](../../static/full_volume_handling.png)

On detecting that a volume is full, the Datanode will get the latest storage reports for all volumes present on the
node. It will add these to the heartbeat and immediately trigger it. If the container is also full, the
CloseContainerAction will be sent in the same heartbeat.

#### Throttling
Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.
Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the activities that use DN volume space,
a. create a new container, reserved 5GB
b. write a new chunk
c. download an import a container, reserved 10GB
d. container metadata rocksdb, no reservation

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.
And when disk is full, DN will and is responsible for not allocate new container on this volume and pick volume as target volume for container import.

So overall my suggestion is
a. carry open container state in periodic storage report
b. when one disk is full, sent a full storage report immediately with open container state to SCM out of cycle.
c. make sure these kind of reports are get handled with priority in SCM. We may consider introduce a new port in SCM, for just DN heartbeat with storage report. Currently all reports are sent to one single port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

Agreed, I'm planning to send a full storage report containing info about all the volumes.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.

Good point. I'm planning to use HddsDispatcher#isVolumeFull for checking if a volume is full. This method ultimately checks whether available - committed - min free space <= 0. So if this method returns true, that means we only need to close containers whose size >= max size (5 GB). All containers on this volume need not be closed.

The Datanode can throttle by ensuring that only one unplanned heartbeat is sent every heartbeat interval or 30 seconds,
whichever is lower. Throttling should be enforced across multiple threads and different volumes.

Here's a visualisation to explain this. The letters (A, B, C etc.) denote events and timestamp is the time at which
an event occurs.
```
Write Call 1:
/ A, timestamp: 0/-------------/B, timestamp: 5/
Write Call 2, in-parallel with 1:
------------------------------ /C, timestamp: 5/
Write Call 3, in-parallel with 1 and 2:
---------------------------------------/D, timestamp: 7/
Write Call 4:
------------------------------------------------------------------------/E, timestamp: 35/
Events:
A: Last, regular heartbeat
B: Volume 1 detected as full, heartbeat triggered
C: Volume 1 again detected as full, heartbeat throttled
D: Volume 2 detected as full, heartbeat throttled
E: Volume 3 detected as full, heartbeat triggered (30 seconds after B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vol 1 is full and has already been reported, but then vol 2 becomes full shortly after, the current global throttling would delay the notification about vol 2 to the SCM until the throttle interval passes. During this period, containers on vol 2 would continue to receive data from the SCM, which could lead to more write failures. I suggest throttling heartbeats independently for each volume, so that a full volume can be reported to the SCM as soon as it is detected, regardless of the status of other volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with your suggestion. The only disadvantage I can think of is that DN may end up sending multiple node reports at the same time (around 20 reports considering 20 volumes in the DN) in the worst case, where each report anyway contains data about all the volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make sure that only one ‘volume-full’ heartbeat is sent at any given time—if the feature is supported, which may require refactoring DatanodeStateMachine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When triggering immediate heartbeat, we're sending storage reports for all volumes. To keep complexity low, I think that's good enough for now. Let's see how it works in prod and in a new jira we can do per-volume throttling if needed. What do you think?

Also tagging @ChenSammi who was discussing this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to open a ticket for this?

```
For code implementation, see https://github.com/apache/ozone/pull/8492.

## Preventing over allocation of blocks in the SCM
Copy link
Member

@peterxcli peterxcli Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you forgot to add a h2 "Alternatives" or "Rejected Approaches" here. And the "Preventing over allocation of blocks in the SCM" should be h3, just like "Regularly sending open container reports."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed this in the latest commit.

Trying to prevent over-allocation of blocks to a container is complicated. We could track how much space we've
allocated to a container in the SCM - this is doable on the surface but won't actually work well. That's because SCM
is asked for a block (256MB), but SCM doesn't know how much data a client will actually write to that block file. The
client may only write 1MB, for example. So SCM could track that it has already allocated 5 GB to a container, and will
open another container for incoming requests, but the client may actually only write 1GB. This would lead to a lot of
open containers when we have 10k requests/second.

At this point, we've decided not to do anything about this.

### Regularly sending open container reports
Sending open container reports regularly (every 30 seconds for example) can help a little bit, but won't solve the
problem. We won't take this approach for now.

## Benefits
SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already
exists, we just update the volume stats in the SCM faster.

## Implementation Plan
1. HDDS-13045: Code for including node report, triggering heartbeat, throttling.
2. HDDS-12151: Fail a write call if it exceeds min free space boundary (not discussed in this doc).
Binary file added hadoop-hdds/docs/static/full_volume_handling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.