HDDS-13922. Increase available space requirement during pipeline/container allocation #9362

rich7420 · 2025-11-25T15:24:24Z

What changes were proposed in this pull request?

This PR added a configurable multiplier (ozone.scm.container.space.requirement.multiplier, default 2.0) that increases the space check from 5GB to 10GB during container allocation. This gives a buffer so multiple allocations don't fail when there's barely enough space.

What is the link to the Apache JIRA

HDDS-13922

How was this patch tested?

test all passed in:
https://github.com/rich7420/ozone/actions/runs/19655213061

…tion

peterxcli

Thanks @rich7420 for this patch! left some comments

peterxcli · 2025-11-25T20:39:51Z

.github/workflows/check.yml

            cat target/${{ inputs.script }}/summary.md >> $GITHUB_STEP_SUMMARY
          fi
-          hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt
+          hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt || true


Suggested change

hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt || true

hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt

peterxcli · 2025-11-25T20:39:59Z

.github/workflows/intermittent-test-check.yml

          DEVELOCITY_ACCESS_KEY: ${{ secrets.DEVELOCITY_ACCESS_KEY }}
      - name: Summary of failures
-        run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt
+        run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt || true


Suggested change

run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt || true

run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt

peterxcli · 2025-11-25T20:40:07Z

.github/workflows/repeat-acceptance.yml

        continue-on-error: true
      - name: Summary of failures
-        run: hadoop-ozone/dev-support/checks/_summary.sh target/${{ github.job }}/summary.txt
+        run: hadoop-ozone/dev-support/checks/_summary.sh target/${{ github.job }}/summary.txt || true


Suggested change

run: hadoop-ozone/dev-support/checks/_summary.sh target/${{ github.job }}/summary.txt || true

run: hadoop-ozone/dev-support/checks/_summary.sh target/${{ github.job }}/summary.txt

peterxcli · 2025-11-25T20:46:56Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java

      "ozone.scm.container.size";
  public static final String OZONE_SCM_CONTAINER_SIZE_DEFAULT = "5GB";

+  public static final String OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER =


I’m thinking this patch only needs to add a config to set the required data size (dataSizeRequired), which we can then read from ReplicationManagerUtil here: https://github.com/peterxcli/ozone/blob/7868a862ff616154e4bd9ab52dada929e15042ec/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManagerUtil.java#L93-L94.
@siddhantsangwan does this approach look correct?

Thanks @peterxcli ! I've tried to simplify the approach removed the new config entirely and directly use HddsServerUtil.requiredReplicationSpace() which already provides the 2x multiplier.

yandrey321 · 2025-11-25T23:18:04Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

  </property>
+  <property>
+    <name>ozone.scm.container.space.requirement.multiplier</name>
+    <value>2.0</value>


it would solve the problem with 2 concurrent allocations, which would both pass. But with increased concurrency we could expect the same failure.

I don't think we even need to add one more config here, instead this value can be 5 or 10x of container size , concurrency prediction is unknown here. Even if concurrency is very high on same DN at the same time, client always has a mechanism to retry on another pipeline.

yandrey321 · 2025-11-25T23:20:51Z

I'd rather suggest graceful handling of Out of space event from DN during container creation and if it fails to use one of already opened containers.

The other approach would be serializing container creation if DN has less than 5 -10% of free space.

peterxcli · 2025-11-26T04:28:29Z

Correct me if im wrong.
IMO, the best way to resolve this from the root cause might be similar approach like: https://issues.apache.org/jira/browse/HDDS-12810 (#8360)

But I havent looked into the related code so I'm not sure whether we can do the same thing at SCM side.
cc @ChenSammi @siddhantsangwan @sumitagrawl

ashishkumar50 · 2025-11-26T04:34:26Z

Correct me if im wrong. IMO, the best way to resolve this from the root cause might be similar approach like: https://issues.apache.org/jira/browse/HDDS-12810 (#8360)

But I havent looked into the related code so I'm not sure whether we can do the same thing at SCM side.

On DN we have committed bytes so it is simple and easy way to handle this scenario. But in SCM we don't have committed bytes to control the same as DN way. And it will be too complex to add committed bytes in SCM.

rich7420 · 2025-11-26T14:38:15Z

I'd rather suggest graceful handling of Out of space event from DN during container creation and if it fails to use one of already opened containers.

The other approach would be serializing container creation if DN has less than 5 -10% of free space.

You're right that 2x only provides a buffer for limited concurrency. However, implementing the more robust solutions you suggested (graceful out-of-space handling, serialized allocation) would require significant architectural changes. I think the current approach using the existing requiredReplicationSpace() method (2x) provides immediate improvement for common cases while keeping the code simple. So maybe we can consider more comprehensive solutions in future work if this proves insufficient. Thanks the suggestion @yandrey321 !

ChenSammi · 2025-12-01T09:10:54Z

@rich7420 , thanks for working on this task.

IIUC, SCM doesn't track the new containers allocated with a specific DN, so in concurrent container allocation requests case, space check in SCM will pass while there is no enough space on DN to serve all new containers(DN side container allocation space reservation is now serialized(https://issues.apache.org/jira/browse/HDDS-12810), so DN will fail the requests when no enough space), so increase the available space requirement check from 5GB in SCM cannot guarantee solve the problem, since the concurrency is not a fixed status, it' hard to decide which number is enough in SCM.
It's not a recommended way to introduce code complexity while it cannot guarantee solve the problem.

siddhantsangwan

I think this will not completely fix the problem but it can improve the situation. If everyone agrees to go ahead with this change, we can make it configurable and have at least 5x, that is 25 GB, as default. This will need to be checked at both times - when creating a pipeline and when creating a new container.

siddhantsangwan · 2025-12-04T07:29:00Z

Do we have any alternative ways to solve this problem?

siddhantsangwan · 2025-12-04T07:35:32Z

I'd rather suggest graceful handling of Out of space event from DN during container creation and if it fails to use one of already opened containers.

graceful handling of Out of space

I'm not sure what the client retry behaviour is. We should be having some retry mechanism to pick a different pipeline. But write performance is already degraded at that point, so we need to make better decisions at the SCM level about which pipeline is given to a client in the first place.

I think we need to test the correctness and performance of CapacityPipelineChoosePolicy and make it the default to reduce the probability of a full pipeline being chosen. CapacityPipelineChoosePolicy in addition to this fix should improve the situation.

adoroszlai · 2025-12-05T12:37:34Z

.github/workflows/check.yml

-          hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt
+          if [[ -f "hadoop-ozone/dev-support/checks/_summary.sh" ]]; then
+            hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt
+          fi


These look unrelated. Why are CI workflow files being changed?

Thanks for asking @adoroszlai ! This PR changed this part Because I got error in without _summary.sh this file.

I can remove it.

I got error in without _summary.sh this file.

_summary.sh does exist in the source repo. Please check for the root cause instead of masking it.

summary.md is checked because it's generated by some CI checks, but not all.

rich7420 · 2025-12-06T02:38:45Z

Thanks for all the valuable feedback.
I agree that while this is a mitigation rather than a strict fix for the race condition, increasing the buffer makes it much more robust for practical use cases.
Updates in the latest commit:
I changed the default ozone.scm.container.space.requirement.multiplier from 2.0 to 5.0 as suggested, to provide a safer buffer for concurrent allocations. And I reverted unrelated changes in .github/workflows to keep the PR clean.
Please take a look. Thanks!

Copilot

Pull request overview

This PR introduces a configurable multiplier for space requirements during container allocation to prevent concurrent allocation failures. The implementation adds ozone.scm.container.space.requirement.multiplier (default 5.0) to increase the space check threshold, though there's a discrepancy with the PR description which states the default is 2.0.

Key Changes:

Added configurable space requirement multiplier with default value of 5.0
Updated container allocation logic to use multiplied space requirement (25GB with default 5GB container size)
Modified pipeline providers (Ratis and EC) to use the new space calculation

Reviewed changes

Copilot reviewed 8 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
hadoop-hdds/common/src/main/resources/ozone-default.xml	Adds configuration property for space requirement multiplier with default value 5.0
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java	Defines the new configuration key and default constant (5.0)
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/HddsServerUtil.java	Adds overloaded method to calculate required space using the configurable multiplier
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java	Updates container allocation to use the new space requirement calculation
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/RatisPipelineProvider.java	Updates Ratis pipeline creation to use multiplied space requirement
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/ECPipelineProvider.java	Updates EC pipeline creation to use multiplied space requirement
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java	Adds test to verify space requirement calculation is used correctly
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestECPipelineProvider.java	Updates test to use the new space requirement calculation
hadoop-ozone/dist/src/main/k8s/examples/*/config-configmap.yaml	Adds container size configuration (1GB) to K8s deployment examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-06T02:44:15Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java


+  public static final String OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER =
+      "ozone.scm.container.space.requirement.multiplier";
+  public static final double OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER_DEFAULT = 5.0;


The PR description states the default multiplier is 2.0, but the implementation sets it to 5.0. This inconsistency should be resolved. Either update the PR description to reflect the actual default of 5.0, or change the default value in the code to 2.0 to match the description.

Suggested change

public static final double OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER_DEFAULT = 5.0;

public static final double OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER_DEFAULT = 2.0;

Copilot · 2025-12-06T02:44:16Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

+    <description>
+      Multiplier for container space requirement when checking if a datanode
+      has enough space for container allocation. The required space is calculated
+      as container size multiplied by this value. This prevents concurrent clients
+      from all passing the space check when there's only enough space for one
+      container. For example, with default container size of 5GB and multiplier
+      of 5.0, the system will require 25GB of available space before allocating
+      a new container. This ensures that if only 6GB is remaining, the check will
+      fail, preventing multiple clients from attempting to create containers
+      concurrently when there's only space for one.
+    </description>


The example in the documentation is incorrect. With a default container size of 5GB and a multiplier of 5.0, the required space would be 25GB (5GB * 5.0), not 10GB as mentioned in the PR description. The documentation correctly states 25GB, but this contradicts the PR description which mentions increasing from 5GB to 10GB (which would require a multiplier of 2.0).

Copilot · 2025-12-06T02:44:16Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

  </property>
+  <property>
+    <name>ozone.scm.container.space.requirement.multiplier</name>
+    <value>5.0</value>


The PR description states the default multiplier is 2.0, but the implementation sets it to 5.0. This inconsistency should be resolved. Either update the PR description to reflect the actual default of 5.0, or change the default value in the code to 2.0 to match the description.

adoroszlai · 2025-12-06T07:09:02Z

Changed to draft due to failures in acceptance tests: https://github.com/rich7420/ozone/actions/runs/19981844525

CI runs with limited space and limited number of nodes. We should either set multiplier=1 or reduce container size.

siddhantsangwan · 2025-12-19T08:53:01Z

I had a separate discussion with @ChenSammi and @ashishkumar50. It was decided it's better to pause this for now, test CapacityPipelineChoosePolicy and then decide if we should go ahead with this. The reason is that we already have a lot of configurations in this area and it's complicated. Adding another configuration or buffer becomes hard to explain to ozone users.

I'm working on testing and exploring CapacityPipelineChoosePolicy, will create a jira soon. Based on testing we can decide whether to go ahead with this pull request (or a better approach). @rich7420 what do you think?

rich7420 · 2025-12-19T09:14:00Z

Sounds good to me, thanks for driving this. Let’s close this PR and proceed with testing CapacityPipelineChoosePolicy. I'm happy to help review the new JIRA/patch or adjust this PR based on the results. thanks @siddhantsangwan

rich7420 added 5 commits November 21, 2025 13:04

Increase available space requirement during pipeline/container alloca…

d9c7f41

…tion

fix error

7ba1b2e

fix errors

73843b4

try fix error

3a816c7

fix it

6757c3a

peterxcli requested review from adoroszlai, ashishkumar50 and peterxcli November 25, 2025 20:39

peterxcli requested changes Nov 25, 2025

View reviewed changes

yandrey321 reviewed Nov 25, 2025

View reviewed changes

apply suggestions

744a85a

fix k8s test error with set 1GB container size

74db8cf

ashishkumar50 requested review from siddhantsangwan and sumitagrawl November 27, 2025 13:06

siddhantsangwan reviewed Dec 4, 2025

View reviewed changes

adoroszlai reviewed Dec 5, 2025

View reviewed changes

update

42adbc7

Copilot AI review requested due to automatic review settings December 6, 2025 02:36

Copilot started reviewing on behalf of rich7420 December 6, 2025 02:37 View session

fix errors

b808305

Copilot AI reviewed Dec 6, 2025

View reviewed changes

fix import errors

c3118ab

adoroszlai marked this pull request as draft December 6, 2025 06:55

rich7420 closed this Dec 19, 2025

	hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt \|\| true
	hadoop-ozone/dev-support/checks/_summary.sh target/${{ inputs.script }}/summary.txt

	run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt \|\| true
	run: hadoop-ozone/dev-support/checks/_summary.sh target/unit/summary.txt

	public static final double OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER_DEFAULT = 5.0;
	public static final double OZONE_SCM_CONTAINER_SPACE_REQUIREMENT_MULTIPLIER_DEFAULT = 2.0;

HDDS-13922. Increase available space requirement during pipeline/container allocation #9362

HDDS-13922. Increase available space requirement during pipeline/container allocation #9362

Uh oh!

Conversation

rich7420 commented Nov 25, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

peterxcli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yandrey321 commented Nov 25, 2025

Uh oh!

peterxcli commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashishkumar50 commented Nov 26, 2025

Uh oh!

rich7420 commented Nov 26, 2025

Uh oh!

ChenSammi commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Dec 4, 2025

Uh oh!

siddhantsangwan commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 commented Dec 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes:

Reviewed changes

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Dec 6, 2025

Uh oh!

siddhantsangwan commented Dec 19, 2025

Uh oh!

rich7420 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

peterxcli commented Nov 26, 2025 •

edited

Loading

ChenSammi commented Dec 1, 2025 •

edited

Loading