Skip to content

Conversation

@sadanand48
Copy link
Contributor

@sadanand48 sadanand48 commented Apr 18, 2023

What changes were proposed in this pull request?

Close containers when volume reaches utilisation threshold, If volume is configured with a reserved space, the softlimit would hit when (capacity-reserved) - used <= minFreeSpaceOnVolume.
By default the value is 5GB.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8254

How was this patch tested?

Unit tests

  • Send close container Action when volume reaches threshold
  • Make Replication manager aware of this volume utilisation limit (No change required here as both PushReplicator and DownloadAndImportReplicator when replicating a container to a target datanode respect the VolumeChoosingPolicy which checks if volume has enough space or not , to place this container there and if there is no space available (required space < available (includes reserved space) ) it doesn't choose the volume

@sadanand48 sadanand48 marked this pull request as ready for review April 21, 2023 06:42
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 Thanks for working over this, IMO,
Currently, this PR, flow is,

  1. if condition met, it notify close of container to SCM, SCM will trigger close after some time
  2. and continue to write
    Where,
    hdds.datanode.du.reserved -- will always create new container and proceed for write,

So this will have impact:

  • create many small containers
  • client may face failure of closed container

But still write will happen till hdds.datanode.du.reserved is met for allocating container, and this property will not provide much value in this regards.

@sadanand48 sadanand48 marked this pull request as draft April 21, 2023 17:46
@sadanand48 sadanand48 marked this pull request as ready for review April 24, 2023 06:34
@sadanand48
Copy link
Contributor Author

Thanks @sumitagrawl for the review, I have now aligned the config to take effect even during container allocation on a volume during its creation. So the flow would be

  1. Volume reaches this threshold usage ratio
  2. Any further writes on these containers would cause the DN to send container action to the SCM.
  3. SCM would receive the action and close these containers.
  4. New client writes will trigger container creation and the VolumeChoosingPolicy would make sure that these volumes are not selected and check this property there too. This way the problem of small containers getting created is eliminated.

client may face failure of closed container

On client retry new containers should be allocated in this case.

@siddhantsangwan
Copy link
Contributor

For bigger volumes like 20TB, the default soft limit of 0.9 still leaves 2TB of space that's available for writes. I'm wondering if it's better to define the limit in a different format such as the raw available capacity - something like 5GB?

@ChenSammi
Copy link
Contributor

Hi @sadanand48 , can we reuse the ""hdds.datanode.storage.utilization.critical.threshold" property?

@sadanand48
Copy link
Contributor Author

For bigger volumes like 20TB, the default soft limit of 0.9 still leaves 2TB of space that's available for writes. I'm wondering if it's better to define the limit in a different format such as the raw available capacity - something like 5GB?

Thanks @siddhantsangwan for the comment. This makes sense, I have updated my patch to use capacity instead of percentage. Please take a look.

@sadanand48
Copy link
Contributor Author

can we reuse the ""hdds.datanode.storage.utilization.critical.threshold" property?

Thanks @ChenSammi for the comment, Now that I have changed the current patch to use capacity, should I still change it to use this property as hdds.datanode.storage.utilization.critical.threshold takes a float that represents percentage.
Also another thing I noticed is that this property is redundant and not used anywhere except to filter output in the SCM JMX , If we don't rename it here , I feel we can remove this property as it can be misleading as configuring it does nothing.

@ChenSammi
Copy link
Contributor

can we reuse the ""hdds.datanode.storage.utilization.critical.threshold" property?

Thanks @ChenSammi for the comment, Now that I have changed the current patch to use capacity, should I still change it to use this property as hdds.datanode.storage.utilization.critical.threshold takes a float that represents percentage. Also another thing I noticed is that this property is redundant and not used anywhere except to filter output in the SCM JMX , If we don't rename it here , I feel we can remove this property as it can be misleading as configuring it does nothing.

Hi @sadanand48 , we already have following properties in Ozone now.

  1. hdds.datanode.dir.du.reserved //storage space
  2. hdds.datanode.dir.du.reserved.percent // storage percentage
  3. hdds.datanode.storage.utilization.warning.threshold // threshold
  4. hdds.datanode.storage.utilization.critical.threshold // threshold

From user's point of view, "hdds.datanode.volume.min.free.space" looks very similar to "hdds.datanode.dir.du.reserved" functionally, like another kind of reserved space. Maybe we can use "hdds.datanode.dir.du.reserved" directly? Currently the default value of "hdds.datanode.dir.du.reserved" and " hdds.datanode.dir.du.reserved.percent" are 0. We can change their default value(5GB, like in this patch, and 0.95 for percent), what do you think?

@sadanand48
Copy link
Contributor Author

sadanand48 commented Apr 26, 2023

From user's point of view, "hdds.datanode.volume.min.free.space" looks very similar to "hdds.datanode.dir.du.reserved" functionally, like another kind of reserved space.

I just gave this a thought and realised that we had introduced this property because the flow is such that

  1. Datanode runs DU/DF periodically and stores the disk stats to a cache , so at any point of time the stats might be behind the actual usage,( although this cache is updated on every write chunk so block data would be accounted in the space but it won't account for rocksdb/raft log metadata)
  2. If we only configure a default for the reserved space, containers will close when usage crosses reserved space. There might be a delay in SCM receiving the close container action and asking the DN's in the pipeline to close it, during this delay client may still write to that container and violate the reserve space

If we are okay with crossing reserved space by a little then we can set a default for reserved space, else we need to have a small buffer like what is defined in this PR , before the reserved space is reached.

@ChenSammi
Copy link
Contributor

From user's point of view, "hdds.datanode.volume.min.free.space" looks very similar to "hdds.datanode.dir.du.reserved" functionally, like another kind of reserved space.

I just gave this a thought and realised that we had introduced this property because the flow is such that

1. Datanode runs DU/DF periodically and stores the disk stats to a cache , so at any point of time the stats might be behind the actual usage,( although this cache is updated on every write chunk so block data would be accounted in the space but it won't account for rocksdb/raft log metadata)

2. If we only configure a default for the reserved space, containers will close when usage crosses reserved space. There might be a delay in SCM receiving the close container action and asking the DN's in the pipeline to close it, during this delay client may still write to that container  and violate the reserve space

If we are okay with crossing reserved space by a little then we can set a default for reserved space, else we need to have a small buffer like what is defined in this PR , before the reserved space is reached.

Had a offline discussion with @sadanand48 , here are the agreed points,

  1. This patch can help to reduce the possibility of run out of disk issue on DN.
  2. Container Quota management(lease) will help to prevent DN from out of disk from SCM side. @sumitagrawl .
  3. Besides the container block data, there are RocksDB directory and Ratis directory which are out of track about their storage usage.
  4. Metrics of RocksDB directory usage and Ratis directory usage will help to estimate how much space should be reserved for them to avoid the DN out of space.

@siddhantsangwan
Copy link
Contributor

@sadanand48 is this PR ready for review now or are you planning to push more commits?

@sadanand48
Copy link
Contributor Author

@siddhantsangwan , This is ready for review now, I have added both options for percentage and value and the user can choose any.

@siddhantsangwan siddhantsangwan self-requested a review May 3, 2023 11:18
Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 Thanks for working on this. In general, I think we should use available space instead of capacity - used. This is because when reserved space hasn't been configured, capacity - used just means subtracting the space used by ozone from the total capacity. It doesn't take into account space used by other applications besides ozone. VolumeInfo#getAvailable will take this into account so it's likely to be accurate. What do you think?

@sadanand48
Copy link
Contributor Author

I think we should use available space instead of capacity - used

Thanks, Good catch, updated the patch.

@sadanand48 sadanand48 requested a review from siddhantsangwan May 9, 2023 06:41
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 LGTM +1

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work @sadanand48

@adoroszlai adoroszlai marked this pull request as draft May 15, 2023 13:09
@ChenSammi ChenSammi marked this pull request as ready for review May 16, 2023 06:23
long volumeFreeSpaceToSpare =
VolumeUsage.getMinVolumeFreeSpace(conf, volumeCapacity);
long volumeAvailable = volume.getAvailable();
return (volumeAvailable <= volumeFreeSpaceToSpare);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 , "- vol.getCommittedBytes()" is missing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadanand48 . The last patch LGTM, +1.

@ChenSammi ChenSammi merged commit 03aec4d into apache:master May 16, 2023
errose28 added a commit to errose28/ozone that referenced this pull request May 17, 2023
* master: (78 commits)
  HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688)
  HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718)
  HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697)
  HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719)
  HDDS-8626. Config thread pool in ReplicationServer (apache#4715)
  HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706)
  HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712)
  HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519)
  HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693)
  HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698)
  HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710)
  HDDS-8463. S3 key uniqueness in deletedTable (apache#4660)
  HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703)
  HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705)
  HDDS-8581. Avoid random ports in integration tests (apache#4699)
  HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694)
  HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692)
  ...
k5342 pushed a commit to pfnet/ozone that referenced this pull request Aug 14, 2023
kuenishi pushed a commit to pfnet/ozone that referenced this pull request Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants