-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-12929. Datanode Should Immediately Trigger Container Close when Volume Full #8460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
e1a649c
7bf2678
bfd7cb5
781731e
484e0d9
198b5fb
ae34404
4f70303
1e4418d
65774c7
1f7ec32
a80879d
d57b13e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,12 +22,22 @@ author: Siddhant Sangwan, Sumit Agrawal | |
| --> | ||
|
|
||
| ## Summary | ||
| On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage report. | ||
| On detecting a full Datanode volume during write, immediately trigger a heartbeat containing the latest storage | ||
| report for all volumes. | ||
|
|
||
| ## Problem | ||
| When a Datanode volume is close to full, the SCM may not be immediately aware because storage reports are only sent | ||
| to it every thirty seconds. This can lead to the SCM allocating multiple blocks to containers on a full DN volume, | ||
| causing performance issues when the write fails. The proposal will partly solve this problem. | ||
| When a Datanode volume is close to full, the SCM may not be immediately aware of this because storage reports are only | ||
| sent to it every one minute (`HDDS_NODE_REPORT_INTERVAL_DEFAULT = "60s"`). We would like SCM to know about this as | ||
| soon as possible, so it can make an informed decision when checking the volumes in that Datanode for deciding whether a | ||
| new pipeline can contain that Datanode (in `SCMCommonPlacementPolicy.hasEnoughSpace`). | ||
|
|
||
| Additionally, SCM only has stale | ||
| information about the current size of a container because container size is only updated when an Incremental Container | ||
| Report (event based, for example when a container transitions from open to closing state) is received or a Full | ||
| Container Report (`HDDS_CONTAINER_REPORT_INTERVAL_DEFAULT = "60m"`) is received. This can lead to the SCM | ||
| over-allocating blocks to containers on a full DN volume. When the writes eventually fail, performance will drop | ||
| because the client will have to request for a different set of blocks. We will discuss how we tried to solve this, | ||
| but ultimately decided to not go ahead with the solution. | ||
|
|
||
| ### The definition of a full volume | ||
| A volume is considered full if the following (existing) method returns true. | ||
|
|
@@ -57,25 +67,52 @@ It accounts for available space, committed space, min free space and reserved sp | |
| } | ||
| ``` | ||
|
|
||
| In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the min free space boundary in a volume. To prevent this from happening often, SCM needs to stop allocating blocks to containers on such volumes in the first place. | ||
| In the future (https://issues.apache.org/jira/browse/HDDS-12151) we plan to fail a write if it's going to exceed the | ||
| min free space boundary in a volume. | ||
|
|
||
| ## Non Goals | ||
| The proposed solution describes the complete solution at a high level, however HDDS-12929 will only add the initial Datanode side code for triggering a heartbeat on detecting a full volume + throttling logic. | ||
| The proposed solution describes the complete solution. HDDS-13045 will add the Datanode side code | ||
| for triggering a heartbeat on detecting a full volume + throttling logic. | ||
|
|
||
| Failing the write if it exceeds the min free space boundary is not discussed here. | ||
| Failing the write if it exceeds the min free space boundary (https://issues.apache.org/jira/browse/HDDS-12151) is not | ||
| discussed here. | ||
|
|
||
| ## Proposed Solution | ||
|
|
||
| ### What does the Datanode do currently? | ||
| ### What does the Datanode do currently when a volume is full? | ||
|
|
||
| In HddsDispatcher, on detecting that the volume being written to is close to full, we add a CloseContainerAction for | ||
| that container. This is sent to the SCM in the next heartbeat and makes the SCM close that container. This reaction time | ||
| is OK for a container that is close to full, but not if the volume is close to full. | ||
| In `HddsDispatcher`, on detecting that the volume being written to is full (as defined previously), we add a | ||
| `CloseContainerAction` for that container: | ||
|
|
||
| ### Proposal | ||
| ```java | ||
sumitagrawl marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| private void sendCloseContainerActionIfNeeded(Container container) { | ||
| // We have to find a more efficient way to close a container. | ||
| boolean isSpaceFull = isContainerFull(container) || isVolumeFull(container); | ||
| boolean shouldClose = isSpaceFull || isContainerUnhealthy(container); | ||
| if (shouldClose) { | ||
| ContainerData containerData = container.getContainerData(); | ||
| ContainerAction.Reason reason = | ||
| isSpaceFull ? ContainerAction.Reason.CONTAINER_FULL : | ||
| ContainerAction.Reason.CONTAINER_UNHEALTHY; | ||
| ContainerAction action = ContainerAction.newBuilder() | ||
| .setContainerID(containerData.getContainerID()) | ||
| .setAction(ContainerAction.Action.CLOSE).setReason(reason).build(); | ||
| context.addContainerActionIfAbsent(action); | ||
| } | ||
| } | ||
|
|
||
| ``` | ||
| This is sent to the SCM in the next heartbeat and makes the SCM close that | ||
| container. This reaction time is OK for a container that is close to full, but not if the volume is close to full. | ||
|
|
||
| ### Proposal for immediately triggering Datanode heartbeat | ||
| This is the proposal, explained via a diagram. | ||
|
||
|
|
||
|  | ||
|  | ||
|
|
||
| On detecting that a volume is full, the Datanode will get the latest storage reports for all volumes present on the | ||
| node. It will add these to the heartbeat and immediately trigger it. If the container is also full, the | ||
| CloseContainerAction will be sent in the same heartbeat. | ||
|
|
||
| #### Throttling | ||
| Throttling is required so the Datanode doesn't cause a heartbeat storm on detecting that some volumes are full in multiple write calls. | ||
|
||
|
|
@@ -106,15 +143,24 @@ E: Volume 3 detected as full, heartbeat triggered (30 seconds after B) | |
| ``` | ||
| For code implementation, see https://github.com/apache/ozone/pull/8492. | ||
|
|
||
| ## Benefits | ||
| 1. SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already exists, we just update the volume stats in the SCM faster. | ||
| 2. Close to full volumes won't cause frequent write failures. | ||
| ## Preventing over allocation of blocks in the SCM | ||
|
||
| Trying to prevent over-allocation of blocks to a container is complicated. We could track how much space we've | ||
| allocated to a container in the SCM - this is doable on the surface but won't actually work well. That's because SCM | ||
| is asked for a block (256MB), but SCM doesn't know how much data a client will actually write to that block file. The | ||
| client may only write 1MB, for example. So SCM could track that it has already allocated 5 GB to a container, and will | ||
| open another container for incoming requests, but the client may actually only write 1GB. This would lead to a lot of | ||
| open containers when we have 10k requests/second. | ||
|
|
||
| At this point, we've decided not to do anything about this. | ||
|
|
||
| ## Alternatives | ||
| Instead of including the list of containers present on the full volume in the Storage Report, we could instead add the volume ID to the Container Replica proto. In the SCM, this would imply that we need to do a linear scan through all the Container Replica objects present in the system to figure out which containers are present on the full volume, which is slow. Alternatively we could build and maintain a map to do this, which is more complex than the proposed solution. | ||
| ### Regularly sending open container reports | ||
| Sending open container reports regularly (every 30 seconds for example) can help a little bit, but won't solve the | ||
| problem. We won't take this approach for now. | ||
|
|
||
| ## Benefits | ||
| SCM will not include a Datanode in a new pipeline if all the volumes on it are full. The logic to do this already | ||
| exists, we just update the volume stats in the SCM faster. | ||
|
|
||
| ## Implementation Plan | ||
| 1. HDDS-13045: Initial code for including node report, triggering heartbeat, throttling. | ||
| 2. HDDS-12151: Fail a write call if it exceeds min free space boundary | ||
| 3. Future Jira: Handle full volume report on the SCM side - close containers. | ||
| 4. HDDS-12658: Try not to select full pipelines when allocating a block in SCM. | ||
| 1. HDDS-13045: Code for including node report, triggering heartbeat, throttling. | ||
| 2. HDDS-12151: Fail a write call if it exceeds min free space boundary (not discussed in this doc). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused on the status of this doc. What part of it maps to what we actually plan to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design changed a lot. We've already merged the corresponding implementation - #8590, and I've updated this doc now.
This comment will tell you why and what changed - #8492 (comment)
This comment and the design doc itself will tell you what we implemented - #8590 (comment)
In short, we trigger heartbeat immediately including the close container action. Per container throttling. Not sending storage reports.