Skip to content

Conversation

@guihecheng
Copy link
Contributor

What changes were proposed in this pull request?

Pipeline placement policy should filter out those datanodes that do not even have enough space for a single container, otherwise even if the pipeline creation succeeds, a datanode may quickly report that it fails to create a container and there will be successive failures because we will choose that node again and again.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5252

How was this patch tested?

a new ut.
manual test.

@guihecheng
Copy link
Contributor Author

@bshashikant @ChenSammi please help review this one, thanks~

@GlenGeng-awx
Copy link
Contributor

Thanks @guihecheng for the fix!
I agree with the necessity of the solution. My concern here is the sizeRequired which is 5GB by default, which seems to be not quite viable for a production cluster.

How about adjust it to base on usage percent of the Datanode? For example, a Datanode with usage great then 80% is not a good candidate for new pipeline.

@guihecheng
Copy link
Contributor Author

@GlenGeng Thanks for your comments, I understand your concern, but I think it is a bit different here.
A datanode with usage > 80% is not a good candidate indeed, and should have a lower priority to be chosen, but it should still have a chance to be chosen. There is a possibility that all datanodes have > 80% space in the cluster, then if we filter out them completely then we don't have a writable cluster while actually there are ~20% space left.

Here we try to solve the problem that if there not not even space for a single container then that datanode should not be chosen anyway.
Maybe we could implement a new placement policy for pipelines that have a priority defined according to available space of datanodes to solve the problem you mentioned ?

@GlenGeng-awx
Copy link
Contributor

Thanks @guihecheng for the clarificaiton.

One concern about the solution based on sizeRequired is, assume there are 3 DN dockers for test, each has 1GB storage space, will scm be able to create a pipeline from them ?

@guihecheng
Copy link
Contributor Author

@GlenGeng oh, I think not, after this one we should have to set a proper container size for test deploy.

Copy link
Contributor

@bshashikant bshashikant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is reasonable. Any container allocation will fail at datanode if it does not have space equal to container size anyways. Its better to to not choose such nodes while allocating a new container.
For docker tests, i think the container size is already set to a lower threshold but we can double check on the same.

@guihecheng
Copy link
Contributor Author

@bshashikant Ah, just now I've a talk with @GlenGeng , he suggested that we could do better here.

We could make the sizeRequired to a new config item such as 'pipeline.reserved.space', may be default to 0.
So, with this new item, we could do the following things:

  1. use default, then we keep the same behavior as before, so users/developers who just want to do small tests or try out ozone will not be annoyed by some warning says that they have no space for pipeline creation.
  2. config it to some value for space reservation (should better be > container size), so then we could solve the problem here, too.
    Moreover, when we have this 'reserved.space' more than container size, say 20% of the volume capacity, then we can filter out the nodes with all its volumes with usage > 80%, which may not be a good choice for pipeline allocation in production as Glen mentioned.
    It seems like we could got overall controll over the space usage of the cluster with this new config item.
    And this should not be conflict with the datanode config 'du.reserved', since its use case is that we have multiple consumers for a single disk.

@bshashikant
Copy link
Contributor

@bshashikant Ah, just now I've a talk with @GlenGeng , he suggested that we could do better here.

We could make the sizeRequired to a new config item such as 'pipeline.reserved.space', may be default to 0.
So, with this new item, we could do the following things:

  1. use default, then we keep the same behavior as before, so users/developers who just want to do small tests or try out ozone will not be annoyed by some warning says that they have no space for pipeline creation.
  2. config it to some value for space reservation (should better be > container size), so then we could solve the problem here, too.
    Moreover, when we have this 'reserved.space' more than container size, say 20% of the volume capacity, then we can filter out the nodes with all its volumes with usage > 80%, which may not be a good choice for pipeline allocation in production as Glen mentioned.
    It seems like we could got overall controll over the space usage of the cluster with this new config item.
    And this should not be conflict with the datanode config 'du.reserved', since its use case is that we have multiple consumers for a single disk.

I guess, we need to be more precise here. A container creation won't succeed on a datanode unless it has remaining space >= container size as there is hard check there(check RoundRobinVolumeChoosingPolicy#chooseVolume()). Its better to to not choose such datanodes upfront during pipeline allocation in any case.

The pipeline reserved space notion should consider ratis volumes size specifically as well as so as to ensure not allocate any new pipelines on nodes which are very low on ratis log disk space, and should have enough space on all constituent datanodes of the pipeline fo the existing pipelines to get closed as closing a pipeline takes a snapshot and does IO on the raft log disk as well.

This patch itself looks good to me and can be committed and pipeline reserve semantic can be then built on top of this.

@guihecheng
Copy link
Contributor Author

guihecheng commented May 24, 2021

@bshashikant Ah, just now I've a talk with @GlenGeng , he suggested that we could do better here.
We could make the sizeRequired to a new config item such as 'pipeline.reserved.space', may be default to 0.
So, with this new item, we could do the following things:

  1. use default, then we keep the same behavior as before, so users/developers who just want to do small tests or try out ozone will not be annoyed by some warning says that they have no space for pipeline creation.
  2. config it to some value for space reservation (should better be > container size), so then we could solve the problem here, too.
    Moreover, when we have this 'reserved.space' more than container size, say 20% of the volume capacity, then we can filter out the nodes with all its volumes with usage > 80%, which may not be a good choice for pipeline allocation in production as Glen mentioned.
    It seems like we could got overall controll over the space usage of the cluster with this new config item.
    And this should not be conflict with the datanode config 'du.reserved', since its use case is that we have multiple consumers for a single disk.

I guess, we need to be more precise here. A container creation won't succeed on a datanode unless it has remaining space >= container size as there is hard check there(check RoundRobinVolumeChoosingPolicy#chooseVolume()). Its better to to not choose such datanodes upfront during pipeline allocation in any case.

Sure, even if we have the proposed config item, we should ensure that if configed value < container size, we should drop it and use container size instead(with a warning log).

The pipeline reserved space notion should consider ratis volumes size specifically as well as so as to ensure not allocate any new pipelines on nodes which are very low on ratis log disk space, and should have enough space on all constituent datanodes of the pipeline fo the existing pipelines to get closed as closing a pipeline takes a snapshot and does IO on the raft log disk as well.

Yes, ratis log disk should be taken into consideration.

This patch itself looks good to me and can be committed and pipeline reserve semantic can be then built on top of this.

Ok, then I'm not going to update this but build another pr for the proposed solution.
Thanks very much @bshashikant

@bshashikant bshashikant merged commit 000e7c5 into apache:master May 26, 2021
@guihecheng guihecheng deleted the HDDS-5252 branch May 26, 2021 06:56
errose28 added a commit to errose28/ozone that referenced this pull request Jun 1, 2021
…ing-upgrade-master-merge

* upstream/master: (76 commits)
  HDDS-5280. Make XceiverClientManager creation when necessary in ContainerOperationClient (apache#2289)
  HDDS-5272. Make ozonefs.robot execution repeatable (apache#2280)
  HDDS-5123. Use the pre-created apache/ozone-testkrb5 image during secure acceptance tests (apache#2165)
  HDDS-4993. Add guardrail for reserved buffer size when DN reads a chunk (apache#2058)
  HDDS-4936. Change ozone groupId from org.apache.hadoop to org.apache.ozone (apache#2018)
  HDDS-4043. allow deletion from Trash directory without -skipTrash option (apache#2110)
  HDDS-4927. Determine over and under utilized datanodes in Container Balancer. (apache#2230)
  HDDS-5273. Handle unsecure cluster convert to secure cluster for SCM. (apache#2281)
  HDDS-5158. Add documentation for SCM HA Security. (apache#2205)
  HDDS-5275. Datanode Report Publisher publishes one extra report after DN shutdown (apache#2283)
  HDDS-5241. SCM UI should have leader/follower and Primordial SCM information (apache#2260)
  HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated. (apache#2243)
  HDDS-5252. PipelinePlacementPolicy filter out datanodes with not enough space. (apache#2271)
  HDDS-5191. Increase default pvc storage size (apache#2219)
  HDDS-5073. Use ReplicationConfig on client side  (apache#2136)
  HDDS-5250. Build integration tests with Maven cache (apache#2269)
  HDDS-5236. Require block token for more operations (apache#2254)
  HDDS-5266 Misspelt words in S3MultipartUploadCommitPartRequest.java line 202 (apache#2279)
  HDDS-5249. Race Condition between Full and Incremental Container Reports (apache#2268)
  HDDS-5142. Make generic streaming client/service for container re-replication, data read, scm/om snapshot download (apache#2256)
  ...

Conflicts:
	hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java
	hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java
	hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java
	hadoop-ozone/dist/src/main/compose/testlib.sh
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestStorageContainerManager.java
	hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
	hadoop-ozone/ozone-manager/pom.xml
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java
	hadoop-ozone/s3gateway/pom.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants