-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" #8178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" #8178
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a design document outlining an "Aggressive DB Compaction with Minimal Degradation" approach. It details new configuration parameters, describes two types of compactors for different table layouts (OBS and FSO), and provides pseudocode examples to illustrate the proposed compaction strategies.
hadoop-hdds/docs/content/feature/aggressive-db-compaction-with-minimal-degradation.md
Show resolved
Hide resolved
hadoop-hdds/docs/content/feature/aggressive-db-compaction-with-minimal-degradation.md
Show resolved
Hide resolved
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @peterxcli for the doc. Design docs should be added in hadoop-hdds/docs/content/design/ (please check existing docs there for metadata to add in header).
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I had glanced over the design and on a high level makes sense to me.
I'm not that familiar with RocksDB and didn't realize RocksDB provides the APIs to gather info at this granularity. That's great!
|
hi, perter @peterxcli thanks work on this, Refer to this code for numDeletions of the sstable. |
|
Thanks @guohao-rosicky! |
|
@peterxcli , thanks for the proposal. Looking forward to see the benchmark data. |
Tejaskriya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this proposal @peterxcli, the overall design looks good. Just wondering how the performance of creating and compacting these ranges would be. Benchmarking data would be helpful!
|
|
||
| ### Create Compactor For Each Table | ||
|
|
||
| Create new compactor instances for each table, including `KEY_TABLE`, `DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run these background workers using a scheduled executor with configured interval and a random start time to spread out the workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also include the multipartInfoTable here, for anyone using multipart uploads, this table also could cross the numEntries and numDeletes thresholds.
Thanks for the review, I will start to work on this ASAP! |
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @peterxcli, the design looks good to me!
One suggestion, can we consider adding a config (e.g., max_parallel_compactions) to control the number of parallel compactions (across multiple tables)? Alternatively, we could handle multiple compaction requests through a CommandQueue mechanism to avoid overloading RocksDB.
Yeah, we can also use priority queue to prioritise the compaction range that has highest tombstone percentage. |
|
I have just completed the benchmark on the OBS table: https://github.com/peterxcli/ozone-helper/blob/main/benchmark/range-compaction/README.md#experimental-results I plan to update the design document in the next few days based on these results. For reference, my POC branch for the benchmark is here: peterxcli#2 Let me know if you have any feedback or suggestions! |
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it. |


What changes were proposed in this pull request?
Short Introduction
Use the
numEntriesandnumDeletionin TableProperties which stores statistics for each SST as "guidance" to determine how to split tables into finer ranges for compaction.Motivation
Our current approach of compacting entire column families directly would significantly impact online performance through excessive write amplification. After researching TiKV and RocksDB compaction mechanisms, it's clear we need a more sophisticated solution that better balances maintenance operations with user workloads.
TiKV runs background tasks for compaction and logically splits key ranges into table regions (with default size limits of 256MB per region), allowing gradual scanning and compaction of known ranges. While we can use the built-in
TablePropertiesin SST files to check metrics likenum_entriesandnum_deletion, these only represent operation counts without deduplicating keys. TiKV addresses this with a customMVCTablePropertiesCollectorfor more accurate results, but unfortunately, the Java API doesn't currently support custom collectors, forcing us to rely on built-in statistics.For the Ozone Manager implementation, we face a different challenge since OM lacks the concept of size-based key range splits. The most logical division we can use is the bucket prefix (file table). For FSO buckets, we can further divide key ranges based on directory
parent_id, enabling more granular and targeted compaction that minimizes disruption to ongoing operations.By implementing bucket-level compaction with proper paging mechanisms like
next_bucketand potentiallynext_parent_idfor directory-related tables, we can achieve more efficient storage utilization while maintaining performance. The Java APIs currently provide enough support to implement these ideas, making this approach viable for Ozone Manager.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12712
How was this patch tested?
no, only doc added.