HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" #8178

peterxcli · 2025-03-27T14:37:11Z

What changes were proposed in this pull request?

Short Introduction

Use the numEntries and numDeletion in TableProperties which stores statistics for each SST as "guidance" to determine how to split tables into finer ranges for compaction.

Motivation

Our current approach of compacting entire column families directly would significantly impact online performance through excessive write amplification. After researching TiKV and RocksDB compaction mechanisms, it's clear we need a more sophisticated solution that better balances maintenance operations with user workloads.

TiKV runs background tasks for compaction and logically splits key ranges into table regions (with default size limits of 256MB per region), allowing gradual scanning and compaction of known ranges. While we can use the built-in TableProperties in SST files to check metrics like num_entries and num_deletion, these only represent operation counts without deduplicating keys. TiKV addresses this with a custom MVCTablePropertiesCollector for more accurate results, but unfortunately, the Java API doesn't currently support custom collectors, forcing us to rely on built-in statistics.

For the Ozone Manager implementation, we face a different challenge since OM lacks the concept of size-based key range splits. The most logical division we can use is the bucket prefix (file table). For FSO buckets, we can further divide key ranges based on directory parent_id, enabling more granular and targeted compaction that minimizes disruption to ongoing operations.

By implementing bucket-level compaction with proper paging mechanisms like next_bucket and potentially next_parent_id for directory-related tables, we can achieve more efficient storage utilization while maintaining performance. The Java APIs currently provide enough support to implement these ideas, making this approach viable for Ozone Manager.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12712

How was this patch tested?

no, only doc added.

Copilot

Pull Request Overview

This PR adds a design document outlining an "Aggressive DB Compaction with Minimal Degradation" approach. It details new configuration parameters, describes two types of compactors for different table layouts (OBS and FSO), and provides pseudocode examples to illustrate the proposed compaction strategies.

hadoop-hdds/docs/content/feature/aggressive-db-compaction-with-minimal-degradation.md

adoroszlai

Thanks @peterxcli for the doc. Design docs should be added in hadoop-hdds/docs/content/design/ (please check existing docs there for metadata to add in header).

jojochuang

Thanks! I had glanced over the design and on a high level makes sense to me.
I'm not that familiar with RocksDB and didn't realize RocksDB provides the APIs to gather info at this granularity. That's great!

guohao-rosicky · 2025-03-28T10:18:09Z

hi, perter @peterxcli thanks work on this, Refer to this code for numDeletions of the sstable.

    RocksDatabase rocksDB = ((RDBStore)getStore()).getDb();
    List<LiveFileMetaData> liveFileMetaDataList =
        rocksDB.getLiveFilesMetaData();
    LiveFileMetaData mostDeletionSST = null;
    for (LiveFileMetaData metadata : liveFileMetaDataList) {
      if (mostDeletionSST == null) {
        mostDeletionSST = metadata;
      } else {
        if (mostDeletionSST.numDeletions() < metadata.numDeletions()) {
          mostDeletionSST = metadata;
        }
      }
    }
    
    if (mostDeletionSST != null) {
      ColumnFamily columnFamily = null;
      for (ColumnFamily cf : rocksDB.getExtraColumnFamilies()) {
        if (cf.getName().equals(mostDeletionSST.columnFamilyName())) {
          columnFamily = cf;
          break;
        }
      }
      if (columnFamily != null) {
        ManagedCompactRangeOptions options =
            new ManagedCompactRangeOptions();
        options.setBottommostLevelCompaction(
            ManagedCompactRangeOptions.BottommostLevelCompaction.kForce);
        rocksDB.compactRange(columnFamily,
            mostDeletionSST.smallestKey(),
            mostDeletionSST.largestKey(),
            options);
      }
    }

peterxcli · 2025-03-28T10:20:51Z

Thanks @guohao-rosicky！

ChenSammi · 2025-04-01T10:29:19Z

@peterxcli , thanks for the proposal. Looking forward to see the benchmark data.

Tejaskriya

Thanks for working on this proposal @peterxcli, the overall design looks good. Just wondering how the performance of creating and compacting these ranges would be. Benchmarking data would be helpful!

Tejaskriya · 2025-04-08T06:55:53Z

hadoop-hdds/docs/content/design/aggressive-db-compaction-with-minimal-degradation.md

+
+### Create Compactor For Each Table
+
+Create new compactor instances for each table, including `KEY_TABLE`, `DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run these background workers using a scheduled executor with configured interval and a random start time to spread out the workload.


We can also include the multipartInfoTable here, for anyone using multipart uploads, this table also could cross the numEntries and numDeletes thresholds.

peterxcli · 2025-04-08T08:34:36Z

Just wondering how the performance of creating and compacting these ranges would be. Benchmarking data would be helpful!

Thanks for the review, I will start to work on this ASAP!

aryangupta1998

Thanks for the patch @peterxcli, the design looks good to me!
One suggestion, can we consider adding a config (e.g., max_parallel_compactions) to control the number of parallel compactions (across multiple tables)? Alternatively, we could handle multiple compaction requests through a CommandQueue mechanism to avoid overloading RocksDB.

peterxcli · 2025-04-11T06:45:07Z

Thanks for the patch @peterxcli, the design looks good to me! One suggestion, can we consider adding a config (e.g., max_parallel_compactions) to control the number of parallel compactions (across multiple tables)? Alternatively, we could handle multiple compaction requests through a CommandQueue mechanism to avoid overloading RocksDB.

Yeah, we can also use priority queue to prioritise the compaction range that has highest tombstone percentage.

peterxcli · 2025-06-16T08:44:30Z

I have just completed the benchmark on the OBS table: https://github.com/peterxcli/ozone-helper/blob/main/benchmark/range-compaction/README.md#experimental-results

I plan to update the design document in the next few days based on these results.

For reference, my POC branch for the benchmark is here: peterxcli#2
Note that it uses my RocksDB fork build, which addresses the JNI return type issue with the getPropertiesOfTablesInRange function. If there's consensus on this proposal, I plan to adapt the existing rocks native module with a patch to achieve the same fix(Thanks for @swamirishi's opinion).

Let me know if you have any feedback or suggestions!

github-actions · 2025-11-12T00:05:48Z

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

github-actions · 2025-11-19T00:06:03Z

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

Add design doc

e2c56cd

peterxcli changed the title ~~HDDS-12682. Aggressive DB Compaction with Minimal Degradation~~ HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" Mar 27, 2025

jojochuang requested a review from Copilot March 27, 2025 15:25

Copilot AI reviewed Mar 27, 2025

View reviewed changes

hadoop-hdds/docs/content/feature/aggressive-db-compaction-with-minimal-degradation.md Show resolved Hide resolved

hadoop-hdds/docs/content/feature/aggressive-db-compaction-with-minimal-degradation.md Show resolved Hide resolved

adoroszlai reviewed Mar 27, 2025

View reviewed changes

adoroszlai added dependencies Pull requests that update a dependency file design labels Mar 27, 2025

jojochuang reviewed Mar 28, 2025

View reviewed changes

Move into content/design folder

cf20e66

jojochuang requested review from hemantk-12 and swamirishi March 30, 2025 20:32

Tejaskriya reviewed Apr 8, 2025

View reviewed changes

aryangupta1998 reviewed Apr 8, 2025

View reviewed changes

peterxcli mentioned this pull request Apr 11, 2025

HDDS-12819. Auto-compact tables which can tend to be large in size at intervals #8260

Merged

Add compaction range job worker

a300ec1

peterxcli mentioned this pull request May 12, 2025

HDDS-12960. SST Statistics-Based RocksDB Compaction Scheduling peterxcli/ozone#2

Closed

Tejaskriya mentioned this pull request May 15, 2025

HDDS-12310. Online repair command to perform compaction on om.db #7957

Merged

github-actions bot added the stale label Nov 12, 2025

github-actions bot closed this Nov 19, 2025


		### Create Compactor For Each Table

		Create new compactor instances for each table, including `KEY_TABLE`, `DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run these background workers using a scheduled executor with configured interval and a random start time to spread out the workload.

HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" #8178

HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" #8178

Uh oh!

Conversation

peterxcli commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Short Introduction

Motivation

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

guohao-rosicky commented Mar 28, 2025

Uh oh!

peterxcli commented Mar 28, 2025

Uh oh!

ChenSammi commented Apr 1, 2025

Uh oh!

Tejaskriya left a comment

Choose a reason for hiding this comment

Uh oh!

Tejaskriya Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

peterxcli commented Apr 8, 2025

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

peterxcli commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterxcli commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

peterxcli commented Mar 27, 2025 •

edited

Loading

peterxcli commented Apr 11, 2025 •

edited

Loading

peterxcli commented Jun 16, 2025 •

edited

Loading