-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-12928. datanode min free space configuration #8388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b07c088
db7211c
cef0da3
f07be0e
c268e57
23f06cf
2fca212
0a215ec
6032c71
96a7486
914ff7f
e3e7480
2b74370
614acd3
c22a69c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,6 +29,7 @@ | |
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_DB_VOLUMES_TOLERATED_KEY; | ||
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_METADATA_VOLUMES_TOLERATED_KEY; | ||
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_VOLUMES_TOLERATED_DEFAULT; | ||
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT_DEFAULT; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a new unit test which doesn't explicitly set any of the two properties.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is considered in org.apache.hadoop.ozone.container.common.statemachine.TestDatanodeConfiguration#isCreatedWitDefaultValues
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isCreatedWitDefaultValues unsets DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unset ensure default value is used in ozone configuration, right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unset is done for ozone-site.xml as defined in test module, so that it can use default value if not defined. comment added. |
||
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT; | ||
| import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.PERIODIC_DISK_CHECK_INTERVAL_MINUTES_KEY; | ||
| import static org.junit.jupiter.api.Assertions.assertEquals; | ||
|
|
@@ -153,6 +154,7 @@ public void overridesInvalidValues() { | |
| public void isCreatedWitDefaultValues() { | ||
| // GIVEN | ||
| OzoneConfiguration conf = new OzoneConfiguration(); | ||
| // unset over-ridding configuration from ozone-site.xml defined for the test module | ||
| conf.unset(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE); // set in ozone-site.xml | ||
|
|
||
| // WHEN | ||
|
|
@@ -176,7 +178,13 @@ public void isCreatedWitDefaultValues() { | |
| assertEquals(BLOCK_DELETE_COMMAND_WORKER_INTERVAL_DEFAULT, | ||
| subject.getBlockDeleteCommandWorkerInterval()); | ||
| assertEquals(DatanodeConfiguration.getDefaultFreeSpace(), subject.getMinFreeSpace()); | ||
| assertEquals(DatanodeConfiguration.MIN_FREE_SPACE_UNSET, subject.getMinFreeSpaceRatio()); | ||
| assertEquals(HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT_DEFAULT, subject.getMinFreeSpaceRatio()); | ||
| final long oneGB = 1024 * 1024 * 1024; | ||
| // capacity is less, consider default min_free_space | ||
| assertEquals(DatanodeConfiguration.getDefaultFreeSpace(), subject.getMinFreeSpace(oneGB)); | ||
| // capacity is large, consider min_free_space_percent, max(min_free_space, min_free_space_percent * capacity)ß | ||
| assertEquals(HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT_DEFAULT * oneGB * oneGB, | ||
| subject.getMinFreeSpace(oneGB * oneGB)); | ||
| } | ||
|
|
||
| @Test | ||
|
|
@@ -186,11 +194,11 @@ void rejectsInvalidMinFreeSpaceRatio() { | |
|
|
||
| DatanodeConfiguration subject = conf.getObject(DatanodeConfiguration.class); | ||
|
|
||
| assertEquals(DatanodeConfiguration.MIN_FREE_SPACE_UNSET, subject.getMinFreeSpaceRatio()); | ||
| assertEquals(HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT_DEFAULT, subject.getMinFreeSpaceRatio()); | ||
| } | ||
|
|
||
| @Test | ||
| void useMinFreeSpaceIfBothMinFreeSpacePropertiesSet() { | ||
| void useMaxIfBothMinFreeSpacePropertiesSet() { | ||
| OzoneConfiguration conf = new OzoneConfiguration(); | ||
| int minFreeSpace = 10000; | ||
| conf.setLong(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE, minFreeSpace); | ||
|
|
@@ -199,10 +207,11 @@ void useMinFreeSpaceIfBothMinFreeSpacePropertiesSet() { | |
| DatanodeConfiguration subject = conf.getObject(DatanodeConfiguration.class); | ||
|
|
||
| assertEquals(minFreeSpace, subject.getMinFreeSpace()); | ||
| assertEquals(DatanodeConfiguration.MIN_FREE_SPACE_UNSET, subject.getMinFreeSpaceRatio()); | ||
| assertEquals(.5f, subject.getMinFreeSpaceRatio()); | ||
sumitagrawl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| for (long capacity : CAPACITIES) { | ||
| assertEquals(minFreeSpace, subject.getMinFreeSpace(capacity)); | ||
| // disk percent is higher than minFreeSpace configured 10000 bytes | ||
| assertEquals((long)(capacity * 0.5f), subject.getMinFreeSpace(capacity)); | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -211,11 +220,12 @@ void useMinFreeSpaceIfBothMinFreeSpacePropertiesSet() { | |
| void usesFixedMinFreeSpace(long bytes) { | ||
| OzoneConfiguration conf = new OzoneConfiguration(); | ||
| conf.setLong(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE, bytes); | ||
| // keeping %cent low so that min free space is picked up | ||
| conf.setFloat(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT, 0.00001f); | ||
|
|
||
| DatanodeConfiguration subject = conf.getObject(DatanodeConfiguration.class); | ||
|
|
||
| assertEquals(bytes, subject.getMinFreeSpace()); | ||
| assertEquals(DatanodeConfiguration.MIN_FREE_SPACE_UNSET, subject.getMinFreeSpaceRatio()); | ||
|
|
||
| for (long capacity : CAPACITIES) { | ||
| assertEquals(bytes, subject.getMinFreeSpace(capacity)); | ||
|
|
@@ -226,7 +236,8 @@ void usesFixedMinFreeSpace(long bytes) { | |
| @ValueSource(ints = {1, 10, 100}) | ||
| void calculatesMinFreeSpaceRatio(int percent) { | ||
| OzoneConfiguration conf = new OzoneConfiguration(); | ||
| conf.unset(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE); // set in ozone-site.xml | ||
| // keeping min free space low so that %cent is picked up after calculation | ||
| conf.set(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE, "1000"); // set in ozone-site.xml | ||
| conf.setFloat(DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT, percent / 100.0f); | ||
|
|
||
| DatanodeConfiguration subject = conf.getObject(DatanodeConfiguration.class); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| --- | ||
| title: Minimum free space configuration for datanode volumes | ||
| summary: Describe proposal for minimum free space configuration which volume must have to function correctly. | ||
| date: 2025-05-05 | ||
| jira: HDDS-12928 | ||
| status: implemented | ||
| author: Sumit Agrawal | ||
| --- | ||
| <!-- | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. See accompanying LICENSE file. | ||
| --> | ||
|
|
||
| # Abstract | ||
| Volume in the datanode stores the container data and metadata (rocks db co-located on the volume). | ||
| There are various parallel operation going on such as import container, export container, write and delete data blocks, | ||
sumitagrawl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| container repairs, create and delete containers. The space is also required for volume db to perform compaction at regular interval. | ||
| This is hard to capture exact usages and free available space. So, this is required to configure minimum free space | ||
| so that datanode operation can perform without any corruption and environment being stuck and support read of data. | ||
|
|
||
| This free space is used to ensure volume allocation if `required space < (volume available space - free space - reserved space - committed space)`. | ||
| Any container creation and import container need to ensure that this constraint is met. And block byte writes need ensure that `free space` space is available. | ||
| Note: Any issue related to ensuring free space is tracked with separate JIRA. | ||
|
|
||
| # Existing configuration (before HDDS-12928) | ||
| Two configurations are provided, | ||
| - hdds.datanode.volume.min.free.space (default: 5GB) | ||
| - hdds.datanode.volume.min.free.space.percent | ||
|
|
||
| 1. If nothing is configured, takes default value as 5GB | ||
| 2. if both are configured, priority to hdds.datanode.volume.min.free.space | ||
| 3. else respective configuration is used. | ||
|
|
||
| # Problem Statement | ||
|
|
||
| - With 5GB default configuration, its not avoiding full disk scenario due to error in ensuring free space availability. | ||
| This is due to container size being imported is 5GB which is near boundary, and other parallel operation. | ||
| - Volume DB size can increase with increase in disk space as container and blocks it can hold can more and hence metadata. | ||
| - Volume DB size can also vary due to small files and big files combination, as more small files can lead to more metadata. | ||
|
|
||
| Solution involves | ||
| - appropriate default min free space | ||
| - depends on disk size variation | ||
|
|
||
| # Approach 1 Combination of minimum free space and percent increase on disk size | ||
|
|
||
| Configuration: | ||
| 1. Minimum free space: hdds.datanode.volume.min.free.space: default value `20GB` | ||
| 2. disk size variation: hdds.datanode.volume.min.free.space.percent: default 0.1% or 0.001 ratio | ||
|
|
||
| Minimum free space = Max (`<Min free space>`, `<percent disk space>`) | ||
|
|
||
| | Disk space | Min Free Space (percent: 1%) | Min Free Space ( percent: 0.1%) | | ||
| | -- |------------------------------|---------------------------------| | ||
| | 100 GB | 20 GB | 20 GB (min space default) | | ||
| | 1 TB | 20 GB | 20 GB (min space default) | | ||
| | 10 TB | 100 GB | 20 GB (min space default) | | ||
| | 100 TB | 1 TB | 100 GB | | ||
|
|
||
| considering above table with this solution, | ||
| - 0.1 % to be sufficient to hold almost all cases, as not observed any dn volume db to be more that 1-2 GB | ||
|
|
||
| # Approach 2 Only minimum free space configuration | ||
|
|
||
| Considering above approach, 20 GB as default should be sufficient for most of the disk, as usually disk size is 10-15TB as seen. | ||
| Higher disk is rarely used, and instead multiple volumes are attached to same DN with multiple disk. | ||
|
|
||
| Considering this scenario, Minimum free space: `hdds.datanode.volume.min.free.space` itself is enough and | ||
| percent based configuration can be removed. | ||
|
|
||
| ### Compatibility | ||
| If `hdds.datanode.volume.min.free.space.percent` is configured, this should not have any impact | ||
| as default value is increased to 20GB which will consider most of the use case. | ||
|
|
||
| # Approach 3 Combination of maximum free space and percent configuration on disk size | ||
|
|
||
| Configuration: | ||
| 1. Maximum free space: hdds.datanode.volume.min.free.space: default value `20GB` | ||
| 2. disk size variation: hdds.datanode.volume.min.free.space.percent: default 10% or 0.1 ratio | ||
|
|
||
| Minimum free space = **Min** (`<Max free space>`, `<percent disk space>`) | ||
| > Difference with approach `one` is, Min function over the 2 above configuration | ||
|
|
||
| | Disk space | Min Free Space (20GB, 10% of disk) | | ||
| | -- |------------------------------------| | ||
| | 10 GB | 1 GB (=Min(20GB, 1GB) | | ||
| | 100 GB | 10 GB (=Min(20GB, 10GB) | | ||
| | 1 TB | 20 GB (=Min(20GB, 100GB) | | ||
| | 10 TB | 20 GB (=Min(20GB, 1TB) | | ||
| | 100 TB | 20GB (=Min(20GB, 10TB) | | ||
|
|
||
| This case is more useful for test environment where disk space is less and no need any additional configuration. | ||
|
|
||
| # Conclusion | ||
| 1. Going with Approach 1 | ||
sumitagrawl marked this conversation as resolved.
Show resolved
Hide resolved
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think supporting both explicit size and percent is good, but there's a few issues still not addressed:
Probably the most user friendly thing to do is deprecate the percent config keys and have one config that takes either a size or percent based value. Whether we want to continue supporting individual volume mappings in the config is still an open question that needs to be resolved in this proposal.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @errose28 Using 2 config has been discussed in community meeting, and concluded to have both. Any concern now, need re-discuss over community again. Single config: Approach "2" is
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Community meetings are for synchronous discussion, not definitive decisions. There are many other forums (mailing list, PRs, Jira, Github discussion). I think this kind of issue is fine for discussion in PR. If you are concerned about visibility, please discuss on mailing list.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @errose28 after discussion over community, will go with
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me try to add some guiding principles for config modifications which can help us compare one decision or another. The following are usability issues that can occur with config keys:
Both Inconsistent config format
Hidden config dependenciesNext let's look at how the percent variations affect point 2. Anything other than failing startup if the percent and non-percent variations are specified creates this problem, so if a percent and non-percent config key are given like There is another option though: get rid of the percentage specific config keys but still support percentage based configuration with the one
Proposal to address all requirementsThe following layout meets all the constraints defined above:
We should never introduce usability issues in our configurations. We have enough of them already : ) If you can show how an alternate proposal meets all the configuration requirements without impacting usability we can consider that as well, but currently none of the proposals in the doc satisfy this.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @errose28 You mean we need have another config for min.free.space? I do not feel being in name of similar config for space, we should go with this approach, These are if different purpose. Making similar I do not agree with this approach. In future if there is a need for this for volume mapping for min.free.space, we can ad as separate requirement and handle. Share your suggestion for this PR if can be merged .....
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding one more to @errose28's list of requirements: cross-compatibility. When extending the possible values allowed for existing configuration, e.g.:
we need to consider that even old version may encounter values understood only by new one, and fail. (See HDDS-13077 for a specific example.) In such cases it may be better to deprecate the existing config properties and add new one(s).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sumitagrawl please re-read the Proposal to address all requirements section in my reply. I think this very clearly states the proposal but the things you are referring to in your reply are not mentioned there.
No, two configs, one for min free space and one for DU reserved that each use the same value schema. I very clearly said in the previous response "Only two config keys: hdds.datanode.min.free.space and hdds.datanode.du.reserved".
This is your take as developer. You need to look at this from a user's perspective. Our consistent failure to consider this perspective is why the system is difficult to use. Configs representing the same "type" of configuration, be it an address, percentage, disk space, time duration, etc must accept the same types of values. Users are not going to understand the nuance of why two similar configs accept different value formats, and in a few months I probably won't either.
This is not part of the proposal. Please re-read it. Min space can be configured with one value across all disks, OR it can use a volume mapping.
Lack of use case is not a valid reason to create a separate value schema for configs that work on the same type. There is also no use case for setting
We definitely need to formalize our configuration compatibility guarantees. This probably warrants a dedicated discussion somewhere more visible. My initial take is that we should always support "new software old config", but that supporting "old software new config" is not sustainable because it closes our config for extensions. Especially on the server side this would seem like a deployment error. Maybe our client side config compat guarantees would be different from the server.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
DU reserved is special case carried from Hadoop, for case of disk sharing by other application. This may not be required to have same value Schema. This needs user input over various disk as sharing may differ, so this schema is specialized. They are not of same type.
From user perspective only, user have no knowledge how to configure the min-free-space, this is more
This might be additional config can be added later on on need basis. May be we should not add just based on
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per disk configuration is an abomination that stems from needing to run other applications on nodes/drives along with HDFS in the past. It makes sense for the I am all for consistency but in this case it implies a capability that I am not sure we wish to implement. |
||
| - Approach 2 is simple setting only min-free-space, but it does not expand with higher disk size. | ||
| - Approach 3 is more applicable for test environment where disk space is less, else same as Approach 2. | ||
| - So Approach 1 is selected considering advantage where higher free space can be configured by default. | ||
| 2. Min Space will be 20GB as default | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.