Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 68 additions & 22 deletions hadoop-hdds/docs/content/design/full-volume-handling.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,22 @@ author: Siddhant Sangwan, Sumit Agrawal
-->

## Summary
On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage report.
On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage
report for all volumes.

## Problem
When a Datanode volume is close to full, the SCM may not be immediately aware because storage reports are only sent
to it every thirty seconds. This can lead to the SCM allocating multiple blocks to containers on a full DN volume,
causing performance issues when the write fails. The proposal will partly solve this problem.
When a Datanode volume is close to full, the SCM may not be immediately aware of this because storage reports are only
sent to it every one minute (`HDDS_NODE_REPORT_INTERVAL_DEFAULT = "60s"`). We would like SCM to know about this as
soon as possible, so it can make an informed decision when checking the volumes in that Datanode for deciding whether a
new pipeline can contain that Datanode (in `SCMCommonPlacementPolicy.hasEnoughSpace`).

Additionally, SCM only has stale
information about the current size of a container because container size is only updated when an Incremental Container
Report (event based, for example when a container transitions from open to closing state) is received or a Full
Container Report (`HDDS_CONTAINER_REPORT_INTERVAL_DEFAULT = "60m"`) is received. This can lead to the SCM
over-allocating blocks to containers on a full DN volume. When the writes eventually fail, performance will drop
because the client will have to request for a different set of blocks. We will discuss how we tried to solve this,
but ultimately decided to not go ahead with the solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused on the status of this doc. What part of it maps to what we actually plan to implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design changed a lot. We've already merged the corresponding implementation - #8590, and I've updated this doc now.

This comment will tell you why and what changed - #8492 (comment)

This comment and the design doc itself will tell you what we implemented - #8590 (comment)

In short, we trigger heartbeat immediately including the close container action. Per container throttling. Not sending storage reports.


### The definition of a full volume
A volume is considered full if the following (existing) method returns true.
Expand Down Expand Up @@ -57,25 +67,52 @@ It accounts for available space, committed space, min free space and reserved sp
}
```

In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the min free space boundary in a volume. To prevent this from happening often, SCM needs to stop allocating blocks to containers on such volumes in the first place.
In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the
min free space boundary in a volume.

## Non Goals
The proposed solution describes the complete solution at a high level, however HDDS-12929 will only add the initial Datanode side code for triggering a heartbeat on detecting a full volume + throttling logic.
The proposed solution describes the complete solution. HDDS-13045 will add the Datanode side code
for triggering a heartbeat on detecting a full volume + throttling logic.

Failing the write if it exceeds the min free space boundary is not discussed here.
Failing the write if it exceeds the min free space boundary (https://issues.apache.org/jira/browse/HDDS-12151) is not
discussed here.

## Proposed Solution

### What does the Datanode do currently?
### What does the Datanode do currently when a volume is full?

In HddsDispatcher, on detecting that the volume being written to is close to full, we add a CloseContainerAction for
that container. This is sent to the SCM in the next heartbeat and makes the SCM close that container. This reaction time
is OK for a container that is close to full, but not if the volume is close to full.
In `HddsDispatcher`, on detecting that the volume being written to is full (as defined previously), we add a
`CloseContainerAction` for that container:

### Proposal
```java
private void sendCloseContainerActionIfNeeded(Container container) {
// We have to find a more efficient way to close a container.
boolean isSpaceFull = isContainerFull(container) || isVolumeFull(container);
boolean shouldClose = isSpaceFull || isContainerUnhealthy(container);
if (shouldClose) {
ContainerData containerData = container.getContainerData();
ContainerAction.Reason reason =
isSpaceFull ? ContainerAction.Reason.CONTAINER_FULL :
ContainerAction.Reason.CONTAINER_UNHEALTHY;
ContainerAction action = ContainerAction.newBuilder()
.setContainerID(containerData.getContainerID())
.setAction(ContainerAction.Action.CLOSE).setReason(reason).build();
context.addContainerActionIfAbsent(action);
}
}

```
This is sent to the SCM in the next heartbeat and makes the SCM close that
container. This reaction time is OK for a container that is close to full, but not if the volume is close to full.

### Proposal for immediately triggering Datanode heartbeat
This is the proposal, explained via a diagram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the diagram adequately explains the proposal. When talking about proposed proto updates writing out code examples is helpful. Throttling implementation needs to be specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of the throttling implementation and even tried it out in code while thinking of the design, but I didn't specify it in the doc. I've added it to the design now. There's also a pull request which implements a part of this design, including throttling - #8492.

Please bear with me as I try to balance too much vs too less content in my designs!

When talking about proposed proto updates writing out code examples is helpful.

Still thinking about what information we can send over the wire. I'll add it once I've decided. There are a couple of high level ideas to prevent over allocating blocks to a container/volume in the SCM -

  1. Track how much allocation is done at the SCM, similar to used space and committed space at the DN. Proposed by @sumitagrawl.
  2. Send reports of open containers every 30 seconds to SCM, which @ChenSammi proposed. Also handle these reports so that any containers with size >= max size are closed.
  3. Decide which containers should be closed in the DN, and send CloseContainerAction for all these containers in the heartbeat (which we briefly discussed in a previous comment).

We may need to do some of these or a mix of all of these. As I think it through I'll add more info.


![full-volume-handling.png](../../static/full-volume-handling.png)
![full_volume_handling.png](../../static/full_volume_handling.png)

On detecting that a volume is full, the Datanode will get the latest storage reports for all volumes present on the
node. It will add these to the heartbeat and immediately trigger it. If the container is also full, the
CloseContainerAction will be sent in the same heartbeat.

#### Throttling
Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.
Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the activities that use DN volume space,
a. create a new container, reserved 5GB
b. write a new chunk
c. download an import a container, reserved 10GB
d. container metadata rocksdb, no reservation

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.
And when disk is full, DN will and is responsible for not allocate new container on this volume and pick volume as target volume for container import.

So overall my suggestion is
a. carry open container state in periodic storage report
b. when one disk is full, sent a full storage report immediately with open container state to SCM out of cycle.
c. make sure these kind of reports are get handled with priority in SCM. We may consider introduce a new port in SCM, for just DN heartbeat with storage report. Currently all reports are sent to one single port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

Agreed, I'm planning to send a full storage report containing info about all the volumes.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.

Good point. I'm planning to use HddsDispatcher#isVolumeFull for checking if a volume is full. This method ultimately checks whether available - committed - min free space <= 0. So if this method returns true, that means we only need to close containers whose size >= max size (5 GB). All containers on this volume need not be closed.

Expand Down Expand Up @@ -106,15 +143,24 @@ E: Volume 3 detected as full, heartbeat triggered (30 seconds after B)
```
For code implementation, see https://github.com/apache/ozone/pull/8492.

## Benefits
1. SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.
2. Close to full volumes won't cause frequent write failures.
## Preventing over allocation of blocks in the SCM
Copy link
Member

@peterxcli peterxcli Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you forgot to add a h2 "Alternatives" or "Rejected Approaches" here. And the "Preventing over allocation of blocks in the SCM" should be h3, just like "Regularly sending open container reports."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed this in the latest commit.

Trying to prevent over-allocation of blocks to a container is complicated. We could track how much space we've
allocated to a container in the SCM - this is doable on the surface but won't actually work well. That's because SCM
is asked for a block (256MB), but SCM doesn't know how much data a client will actually write to that block file. The
client may only write 1MB, for example. So SCM could track that it has already allocated 5 GB to a container, and will
open another container for incoming requests, but the client may actually only write 1GB. This would lead to a lot of
open containers when we have 10k requests/second.

At this point, we've decided not to do anything about this.

## Alternatives
Instead of including the list of containers present on the full volume in the Storage Report, we could instead add the volume ID to the Container Replica proto. In the SCM, this would imply that we need to do a linear scan through all the Container Replica objects present in the system to figure out which containers are present on the full volume, which is slow. Alternatively we could build and maintain a map to do this, which is more complex than the proposed solution.
### Regularly sending open container reports
Sending open container reports regularly (every 30 seconds for example) can help a little bit, but won't solve the
problem. We won't take this approach for now.

## Benefits
SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already
exists, we just update the volume stats in the SCM faster.

## Implementation Plan
1. HDDS-13045: Initial code for including node report, triggering heartbeat, throttling.
2. HDDS-12151: Fail a write call if it exceeds min free space boundary
3. Future Jira: Handle full volume report on the SCM side - close containers.
4. HDDS-12658: Try not to select full pipelines when allocating a block in SCM.
1. HDDS-13045: Code for including node report, triggering heartbeat, throttling.
2. HDDS-12151: Fail a write call if it exceeds min free space boundary (not discussed in this doc).
Binary file removed hadoop-hdds/docs/static/full-volume-handling.png
Binary file not shown.
Binary file added hadoop-hdds/docs/static/full_volume_handling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.