-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-12468. Check for space availability for all dns while container creation in pipeline #8663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…creation in pipeline
|
|
||
| Pipeline newPipeline = pipelineManager.createPipeline(repConfig, | ||
| excludedNodes, Collections.emptyList()); | ||
| // the returned ContainerInfo should not be null (due to not enough space in the Datanodes specifically) because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could newPipeline be null when DN is close to full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I checked, pipeline will not be null. There's a checkPipeline(pipeline) method that checks for null and throws an exception instead.
| * @return false if all the volumes on any Datanode in the pipeline have space less than equal to the specified | ||
| * containerSize, otherwise true | ||
| */ | ||
| boolean pipelineHasEnoughSpaceForNewContainer(Pipeline pipeline, long containerSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipelineHasEnoughSpaceForNewContainer -> HasEnoughSpace
| allocateContainer(pipeline, owner); | ||
| containerIDs = getContainersForOwner(pipeline, owner); | ||
| } else { | ||
| LOG.debug("Cannot allocate a new container because pipeline {} does not have the required space {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the if (LOG.isDebugEnabled) check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the LOG.isDebugEnabled check isn't needed when using slf4j with "{}" style logging. The log string is evaluated after the debug method itself calls isDebugEnabled().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea if the log parameters are just simple references to be passed, the isDebugEnabled() is no needed as the logger does it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know that slf4j with "{}" will delay the string construction, thanks!
| * @param excludedContainerIDS - containerIds to be excluded. | ||
| * @return ContainerInfo for the matching container. | ||
| * @return ContainerInfo for the matching container, or null if a container could not be found and could not be | ||
| * allocated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use @nullable, which is very straightforward.
|
@ChenSammi thanks for the review. Addressed your comments in the latest commit. |
| if (containerIDs.size() < getOpenContainerCountPerPipeline(pipeline)) { | ||
| allocateContainer(pipeline, owner); | ||
| containerIDs = getContainersForOwner(pipeline, owner); | ||
| if (pipelineManager.hasEnoughSpace(pipeline, containerSize)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this check to allocateContainer method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the check to allocateContainer causes too many unrelated test failures because it is being used in unit tests which don't care about pipeline state. I could move it but we'll need to fix those other tests then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fix them by doing something like this, what do you think?
diff --git a/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java
index 65b06bfd42..2e30b72888 100644
--- a/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java
+++ b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java
@@ -243,6 +243,10 @@ private ContainerInfo createContainer(Pipeline pipeline, String owner)
private ContainerInfo allocateContainer(final Pipeline pipeline,
final String owner)
throws IOException {
+ if (!pipelineManager.hasEnoughSpace(pipeline, containerSize)) {
+ return null;
+ }
+
final long uniqueId = sequenceIdGen.getNextId(CONTAINER_ID);
Preconditions.checkState(uniqueId > 0,
"Cannot allocate container, negative container id" +
diff --git a/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java b/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java
index baa09cd854..df74709f93 100644
--- a/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java
+++ b/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java
@@ -84,7 +84,7 @@ public class TestContainerManagerImpl {
private SequenceIdGenerator sequenceIdGen;
private NodeManager nodeManager;
private ContainerReplicaPendingOps pendingOpsMock;
- private PipelineManager pipelineManager;
+ private MockPipelineManager pipelineManager;
@BeforeAll
static void init() {
@@ -142,6 +142,7 @@ public void testGetMatchingContainerReturnsNullWhenNotEnoughSpaceInDatanodes() t
Pipeline pipeline = pipelineManager.getPipelines().iterator().next();
// MockPipelineManager#hasEnoughSpace always returns false
// the pipeline has no existing containers, so a new container gets allocated in getMatchingContainer
+ pipelineManager.setHasEnoughSpaceBehavior(false);
ContainerInfo container = containerManager
.getMatchingContainer(sizeRequired, "test", pipeline, Collections.emptySet());
assertNull(container);
@@ -159,9 +160,9 @@ public void testGetMatchingContainerReturnsContainerWhenEnoughSpaceInDatanodes()
long sizeRequired = 256 * 1024 * 1024; // 256 MB
// create a spy to mock hasEnoughSpace to always return true
- PipelineManager spyPipelineManager = spy(pipelineManager);
- doReturn(true).when(spyPipelineManager)
- .hasEnoughSpace(any(Pipeline.class), anyLong());
+ PipelineManager spyPipelineManager = pipelineManager;
+// doReturn(true).when(spyPipelineManager)
+// .hasEnoughSpace(any(Pipeline.class), anyLong());
// create a new ContainerManager using the spy
File tempDir = new File(testDir, "tempDir");
diff --git a/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java b/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java
index 78f1865d2b..3274de3d8a 100644
--- a/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java
+++ b/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java
@@ -48,6 +48,7 @@
public class MockPipelineManager implements PipelineManager {
private final PipelineStateManager stateManager;
+ private boolean hasEnoughSpaceBehavior = true;
public MockPipelineManager(DBStore dbStore, SCMHAManager scmhaManager,
NodeManager nodeManager) throws IOException {
@@ -352,6 +353,10 @@ public boolean isPipelineCreationFrozen() {
@Override
public boolean hasEnoughSpace(Pipeline pipeline, long containerSize) {
- return false;
+ return hasEnoughSpaceBehavior;
+ }
+
+ public void setHasEnoughSpaceBehavior(boolean hasEnoughSpace) {
+ hasEnoughSpaceBehavior = hasEnoughSpace;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no handling for null in the SCMClientProtocolServer code flow that calls allocateContainer, so there can be a null pointer exception if allocateContainer returns null. I think this code path is being used for benchmarks. Probably better not to make changes that impact this path in this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan, let's do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow up jira - https://issues.apache.org/jira/browse/HDDS-13338
# Conflicts: # hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManager.java
|
@siddhantsangwan, There is a integration test failure, can you take a look at it? |
It's unrelated and flaky. Passed in my fork - https://github.com/siddhantsangwan/ozone/actions/runs/15845888336 |
|
I re-triggered the failing test. |
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one of minior comment need fix.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java
Outdated
Show resolved
Hide resolved
|
CI has passed, merging on the basis of Sumit's LGTM. |
* master: (90 commits) HDDS-12468. Check for space availability for all dns while container creation in pipeline (apache#8663) HDDS-12151. Fail write when volume is full considering min free space (apache#8642) HDDS-13251. Update Byteman usage README license (apache#8700) HDDS-13251. Support dynamic Byteman scripts via bmsubmit in ozonesecure-ha (apache#8654) HDDS-12070. Bump Ratis to 3.2.0 (apache#8689) HDDS-12984. Use InodeID to identify the SST files inside the tarball. (apache#8477) HDDS-13319. Simplify KeyPrefixFilter (apache#8692) HDDS-13295. Remove jackson1 exclusions for hadoop-common (apache#8687) HDDS-13314. Remove unused maven-pdf-plugin (apache#8686) HDDS-13324. Optimize memory footprint for Recon listKeys API (apache#8680) HDDS-13240. Add newly added metrics into grafana dashboard. (apache#8656) HDDS-13309. Add keyIterator/valueIterator methods to Table. (apache#8675) HDDS-13325. Introduce OZONE_SERVER_OPTS for common options for server processes (apache#8685) HDDS-13318. Simplify the getRangeKVs methods in Table (apache#8683) HDDS-13322. Remove module auto-detection from flaky-test-check (apache#8679) HDDS-13289. Remove usage of Jetty StringUtil (apache#8684) HDDS-13270. Reduce getBucket API invocations in S3 bucket owner verification (apache#8653) HDDS-13288. Container checksum file proto changes to account for deleted blocks. (apache#8649) HDDS-13306. Intermittent failure in testDirectoryDeletingServiceIntervalReconfiguration (apache#8682) HDDS-13317. Table should support empty array/String (apache#8676) ...
* master: (170 commits) HDDS-12468. Check for space availability for all dns while container creation in pipeline (apache#8663) HDDS-12151. Fail write when volume is full considering min free space (apache#8642) HDDS-13251. Update Byteman usage README license (apache#8700) HDDS-13251. Support dynamic Byteman scripts via bmsubmit in ozonesecure-ha (apache#8654) HDDS-12070. Bump Ratis to 3.2.0 (apache#8689) HDDS-12984. Use InodeID to identify the SST files inside the tarball. (apache#8477) HDDS-13319. Simplify KeyPrefixFilter (apache#8692) HDDS-13295. Remove jackson1 exclusions for hadoop-common (apache#8687) HDDS-13314. Remove unused maven-pdf-plugin (apache#8686) HDDS-13324. Optimize memory footprint for Recon listKeys API (apache#8680) HDDS-13240. Add newly added metrics into grafana dashboard. (apache#8656) HDDS-13309. Add keyIterator/valueIterator methods to Table. (apache#8675) HDDS-13325. Introduce OZONE_SERVER_OPTS for common options for server processes (apache#8685) HDDS-13318. Simplify the getRangeKVs methods in Table (apache#8683) HDDS-13322. Remove module auto-detection from flaky-test-check (apache#8679) HDDS-13289. Remove usage of Jetty StringUtil (apache#8684) HDDS-13270. Reduce getBucket API invocations in S3 bucket owner verification (apache#8653) HDDS-13288. Container checksum file proto changes to account for deleted blocks. (apache#8649) HDDS-13306. Intermittent failure in testDirectoryDeletingServiceIntervalReconfiguration (apache#8682) HDDS-13317. Table should support empty array/String (apache#8676) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerChecksumTreeManager.java
…e container creation in pipeline (apache#8663) (cherry picked from commit 7a237fe) Conflicts: SCMNodeManager#getDatanodeInfo: This version does not have the method `@Nullable DatanodeDetails getNode(@nullable DatanodeID id)` so I directly used nodeStateManager#getNode to get the `DatanodeInfo` object for a `DatanodeDetails`. Other Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManager.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/HddsTestUtils.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/SimpleMockNodeManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineManagerImpl.java
What changes were proposed in this pull request?
When finding a container for a block, the SCM needs to allocate a new container if the existing containers don't match requirements. Currently, the SCM just creates a new container without checking if the Datanodes in that particular pipeline have enough disk space (5 GB) for the new container.
This pull request adds a check for both Ratis and EC flows. The check happens at a single place in
ContainerManagerImpl. If the SCM is unable to allocate a new container then a different pipeline needs to be tried - code to handle this already exists.The logic for the check itself already exists in
SCMCommonPlacementPolicy#hasEnoughSpaceand I reuse that here. This existing logic has interesting behaviour for one particular case - it returns false if the exact amount of space is available. For example if the required space is 5 GB, and one datanode has exactly 5 GB, the method will return false. It seems like a defensive way of handling this corner case of disks being close to full.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12468
How was this patch tested?
Added unit tests.
Ready for review, draft while waiting for CI - https://github.com/siddhantsangwan/ozone/actions/runs/15757908756/job/44418185632