Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

This is a design discussion for HDDS-12929. Please see the changes for all the details.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12929

How was this patch tested?

Rendered the markdown in IntelliJ, it looked fine.

@siddhantsangwan
Copy link
Contributor Author

@peterxcli you may also be interested in reviewing this. For some reason your username doesn't appear in the "Request reviewers" UI.

Comment on lines 43 to 45
In HddsDispatcher, on detecting that the volume being written to is close to full, we add a CloseContainerAction for
that container. This is sent to the SCM in the next heartbeat and makes the SCM close that container. This reaction time
is OK for a container that is close to full, but not if the volume is close to full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full. This leaves the volume to container mapping inside the datanode without needing to add it to SCM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this account for the 5GB of reserved/committed space for open containers? I believe that will be counted against the volume's capacity as well. We need to break down what measures are being used to consider a "full" volume.

Yes, and we will use the existing method in the current code to determine whether a volume is full. It accounts for committed space, min free space and reserved space.

  private boolean isVolumeFull(Container container) {
    boolean isOpen = Optional.ofNullable(container)
        .map(cont -> cont.getContainerState() == ContainerDataProto.State.OPEN)
        .orElse(Boolean.FALSE);
    if (isOpen) {
      HddsVolume volume = container.getContainerData().getVolume();
      StorageLocationReport volumeReport = volume.getReport();
      boolean full = volumeReport.getUsableSpace() <= 0;
      if (full) {
        LOG.info("Container {} volume is full: {}", container.getContainerData().getContainerID(), volumeReport);
      }
      return full;
    }
    return false;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the ability to immediately trigger a heartbeat at any time. We do this for volume failure already. Seems the issue could be resolved by leaving the CloseContainerAction handling as is and just calling this method when the volume is getting full.

Yes, that's what I'm planning to do. However before sending the heartbeat we need to generate the latest storage report and add it to the heartbeat. Also, the CloseContainerAction is currently only for that one container. We either need to add an action for all containers on that volume, or send a list of all container ids in that volume in the heartbeat.

Actually one idea I've got from this convo is that adding an action for each container might be easier, as we already have a framework in-place in the SCM for handling these actions. Let me think a bit more about this. An immediate hurdle is that the number of actions that can be sent is also throttled as of now.

is OK for a container that is close to full, but not if the volume is close to full.

### Proposal
This is the proposal, explained via a diagram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the diagram adequately explains the proposal. When talking about proposed proto updates writing out code examples is helpful. Throttling implementation needs to be specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of the throttling implementation and even tried it out in code while thinking of the design, but I didn't specify it in the doc. I've added it to the design now. There's also a pull request which implements a part of this design, including throttling - #8492.

Please bear with me as I try to balance too much vs too less content in my designs!

When talking about proposed proto updates writing out code examples is helpful.

Still thinking about what information we can send over the wire. I'll add it once I've decided. There are a couple of high level ideas to prevent over allocating blocks to a container/volume in the SCM -

  1. Track how much allocation is done at the SCM, similar to used space and committed space at the DN. Proposed by @sumitagrawl.
  2. Send reports of open containers every 30 seconds to SCM, which @ChenSammi proposed. Also handle these reports so that any containers with size >= max size are closed.
  3. Decide which containers should be closed in the DN, and send CloseContainerAction for all these containers in the heartbeat (which we briefly discussed in a previous comment).

We may need to do some of these or a mix of all of these. As I think it through I'll add more info.

## Problem
When a Datanode volume is close to full, the SCM may not be immediately aware because storage reports are only sent
to it every thirty seconds. This can lead to the SCM allocating multiple blocks to containers on a full DN volume,
causing performance issues when the write fails. The proposal will partly solve this problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCM will check the container size before allocating block for this container. Currently the container size is reported when container report is full, or container state is changed from open to other state. So SCM is kindly allocating the blocks blindly. In this 30s storage reports, I think we should consider report the open containers too, to help SCM better understand the open container state and avoid over allocate blocks for one container. @siddhantsangwan , what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean include a full container report for all containers in the DN, not just the ones on the full volume? We can use the method StateContext#getFullContainerReportDiscardPendingICR.

Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open containers report, not full container report. As only open container will grow its size. Other container's size will just shrink if there is any change. And a timely open container size update will help SCM allocating block more precisely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan
When container is 90% full, we add ICR for closeContainer action to be send for next HB. This can be send immediately with this which is already present, and no need wait for next HB for these.
I think when sending ICR for this, it would have been added, can be verified.

@ChenSammi Since DN is already taking decision to stop block allocation when 90% is full and send is send, May be SCM sending open container list will not provide any further benefits as action is already taken by DN.

Need to see if any additional benefits we can having sending OpenContainer information also. This may be tracked to send Open Containers on every HB with separate JIRA and based on benefits, this can be implemented.


![full-volume-handling.png](../../static/full-volume-handling.png)

Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.
Copy link
Contributor

@ChenSammi ChenSammi May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the activities that use DN volume space,
a. create a new container, reserved 5GB
b. write a new chunk
c. download an import a container, reserved 10GB
d. container metadata rocksdb, no reservation

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.
And when disk is full, DN will and is responsible for not allocate new container on this volume and pick volume as target volume for container import.

So overall my suggestion is
a. carry open container state in periodic storage report
b. when one disk is full, sent a full storage report immediately with open container state to SCM out of cycle.
c. make sure these kind of reports are get handled with priority in SCM. We may consider introduce a new port in SCM, for just DN heartbeat with storage report. Currently all reports are sent to one single port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when SCM allocates a new pipeline, SCM checks whether DN has enough space to hold pipeline metainfo(raft, 1GB), and one container(5GB). A volume full report can help SCM quickly aware of this. Maybe a full storage report, instead of single volume full report.

Agreed, I'm planning to send a full storage report containing info about all the volumes.

As for carrying the list of containers on this volume in disk full report proposal, because open container has already reserved space in volume, same for container replication import, although disk volume is full, these open containers many still have room for new blocks, as long as the total container size doesn't exceed 5GB. So closing all open containers of this disk full volume immediately might not a necessary step. But closing open containers whose size beyonds 5GB is one thing we can do.

Good point. I'm planning to use HddsDispatcher#isVolumeFull for checking if a volume is full. This method ultimately checks whether available - committed - min free space <= 0. So if this method returns true, that means we only need to close containers whose size >= max size (5 GB). All containers on this volume need not be closed.

Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls.

## Benefits
1. SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the SCM logic to allocate pipelines take into account the min free space setting? say volume reached 95GB/100GB disk space and triggered close for that container and sent volume report and SCM updated volume stats for the node. Does SCM ensure that it does not allocate containers at 95GB usage or it waits till 100GB? I believe SCM should be aware/ have a similar or more liberal config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it takes min free space into account

Comment on lines 101 to 105
A: Last, regular heartbeat
B: Volume 1 detected as full, heartbeat triggered
C: Volume 1 again detected as full, heartbeat throttled
D: Volume 2 detected as full, heartbeat throttled
E: Volume 3 detected as full, heartbeat triggered (30 seconds after B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vol 1 is full and has already been reported, but then vol 2 becomes full shortly after, the current global throttling would delay the notification about vol 2 to the SCM until the throttle interval passes. During this period, containers on vol 2 would continue to receive data from the SCM, which could lead to more write failures. I suggest throttling heartbeats independently for each volume, so that a full volume can be reported to the SCM as soon as it is detected, regardless of the status of other volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with your suggestion. The only disadvantage I can think of is that DN may end up sending multiple node reports at the same time (around 20 reports considering 20 volumes in the DN) in the worst case, where each report anyway contains data about all the volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make sure that only one ‘volume-full’ heartbeat is sent at any given time—if the feature is supported, which may require refactoring DatanodeStateMachine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When triggering immediate heartbeat, we're sending storage reports for all volumes. To keep complexity low, I think that's good enough for now. Let's see how it works in prod and in a new jira we can do per-volume throttling if needed. What do you think?

Also tagging @ChenSammi who was discussing this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to open a ticket for this?

@siddhantsangwan
Copy link
Contributor Author

I discussed this design with several people in a meeting. Here are some of the important points:

  1. Trying to prevent over allocation of blocks to a container is complicated. We could track how much space we've allocated to a container in the SCM - this is doable on the surface but won't actually work well. That's because SCM is asked for a block (256MB), but SCM doesn't know how much data a client will actually write to that block file. The client may only write 1MB, for example. So SCM could track that it has already allocate 5 GB to a container, and will open another container for incoming requests, but the client may actually only write 1GB. This would lead to a lot of open containers when we have 10k requests/second. (brought up @sumitagrawl and @ChenSammi).
  2. Sending open container reports regularly (every 30 seconds for example) can help a little bit, but won't solve the problem. We won't take this approach for now. (discussed with @ChenSammi and @nandakumar131).
  3. A problem out of scope of this design, but was discussed, is that Replication Manager + Placement Policy should not keep on selecting a close-to-full Datanode as the target for a container. https://issues.apache.org/jira/browse/HDDS-12998, which is related to this, was also briefly mentioned. This point should be analysed in a different jira. (brought up by @ashishkumar50).

In summary, the only change this design proposes is to immediately trigger a heartbeat containing storage reports of all volumes when the isVolumeFull check returns true. This will promptly update the volumes' storage stats in the SCM. The SCM can take these into account when creating new pipelines - there's already logic in the SCM for that.

This is the only remaining open point for discussion - #8460 (comment).

@siddhantsangwan
Copy link
Contributor Author

Updated the doc to reflect the latest discussion, please take a look.

@siddhantsangwan siddhantsangwan marked this pull request as ready for review May 30, 2025 08:44
@siddhantsangwan
Copy link
Contributor Author

@errose28 would you like to take another look?

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 101 to 105
A: Last, regular heartbeat
B: Volume 1 detected as full, heartbeat triggered
C: Volume 1 again detected as full, heartbeat throttled
D: Volume 2 detected as full, heartbeat throttled
E: Volume 3 detected as full, heartbeat triggered (30 seconds after B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to open a ticket for this?

## Benefits
1. SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster.
2. Close to full volumes won't cause frequent write failures.
## Preventing over allocation of blocks in the SCM
Copy link
Member

@peterxcli peterxcli Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you forgot to add a h2 "Alternatives" or "Rejected Approaches" here. And the "Preventing over allocation of blocks in the SCM" should be h3, just like "Regularly sending open container reports."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed this in the latest commit.

Comment on lines 39 to 40
because the client will have to request for a different set of blocks. We will discuss how we tried to solve this,
but ultimately decided to not go ahead with the solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused on the status of this doc. What part of it maps to what we actually plan to implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design changed a lot. We've already merged the corresponding implementation - #8590, and I've updated this doc now.

This comment will tell you why and what changed - #8492 (comment)

This comment and the design doc itself will tell you what we implemented - #8590 (comment)

In short, we trigger heartbeat immediately including the close container action. Per container throttling. Not sending storage reports.

@siddhantsangwan
Copy link
Contributor Author

I've updated this design doc to reflect the implementation that's already been merged - #8590

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Plz update design doc

@siddhantsangwan
Copy link
Contributor Author

@sumitagrawl thanks for the review, I've addressed your comments.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl sumitagrawl merged commit f727051 into apache:master Jul 7, 2025
14 checks passed
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants