Skip to content

Conversation

@peterxcli
Copy link
Member

@peterxcli peterxcli commented Mar 27, 2025

What changes were proposed in this pull request?

Short Introduction

Use the numEntries and numDeletion in TableProperties which stores statistics for each SST as "guidance" to determine how to split tables into finer ranges for compaction.

Motivation

Our current approach of compacting entire column families directly would significantly impact online performance through excessive write amplification. After researching TiKV and RocksDB compaction mechanisms, it's clear we need a more sophisticated solution that better balances maintenance operations with user workloads.

TiKV runs background tasks for compaction and logically splits key ranges into table regions (with default size limits of 256MB per region), allowing gradual scanning and compaction of known ranges. While we can use the built-in TableProperties in SST files to check metrics like num_entries and num_deletion, these only represent operation counts without deduplicating keys. TiKV addresses this with a custom MVCTablePropertiesCollector for more accurate results, but unfortunately, the Java API doesn't currently support custom collectors, forcing us to rely on built-in statistics.

For the Ozone Manager implementation, we face a different challenge since OM lacks the concept of size-based key range splits. The most logical division we can use is the bucket prefix (file table). For FSO buckets, we can further divide key ranges based on directory parent_id, enabling more granular and targeted compaction that minimizes disruption to ongoing operations.

By implementing bucket-level compaction with proper paging mechanisms like next_bucket and potentially next_parent_id for directory-related tables, we can achieve more efficient storage utilization while maintaining performance. The Java APIs currently provide enough support to implement these ideas, making this approach viable for Ozone Manager.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12712

How was this patch tested?

no, only doc added.

@peterxcli peterxcli changed the title HDDS-12682. Aggressive DB Compaction with Minimal Degradation HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" Mar 27, 2025
@jojochuang jojochuang requested a review from Copilot March 27, 2025 15:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a design document outlining an "Aggressive DB Compaction with Minimal Degradation" approach. It details new configuration parameters, describes two types of compactors for different table layouts (OBS and FSO), and provides pseudocode examples to illustrate the proposed compaction strategies.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peterxcli for the doc. Design docs should be added in hadoop-hdds/docs/content/design/ (please check existing docs there for metadata to add in header).

@adoroszlai adoroszlai added dependencies Pull requests that update a dependency file design labels Mar 27, 2025
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had glanced over the design and on a high level makes sense to me.
I'm not that familiar with RocksDB and didn't realize RocksDB provides the APIs to gather info at this granularity. That's great!

@guohao-rosicky
Copy link
Contributor

hi, perter @peterxcli thanks work on this, Refer to this code for numDeletions of the sstable.

    RocksDatabase rocksDB = ((RDBStore)getStore()).getDb();
    List<LiveFileMetaData> liveFileMetaDataList =
        rocksDB.getLiveFilesMetaData();
    LiveFileMetaData mostDeletionSST = null;
    for (LiveFileMetaData metadata : liveFileMetaDataList) {
      if (mostDeletionSST == null) {
        mostDeletionSST = metadata;
      } else {
        if (mostDeletionSST.numDeletions() < metadata.numDeletions()) {
          mostDeletionSST = metadata;
        }
      }
    }
    
    if (mostDeletionSST != null) {
      ColumnFamily columnFamily = null;
      for (ColumnFamily cf : rocksDB.getExtraColumnFamilies()) {
        if (cf.getName().equals(mostDeletionSST.columnFamilyName())) {
          columnFamily = cf;
          break;
        }
      }
      if (columnFamily != null) {
        ManagedCompactRangeOptions options =
            new ManagedCompactRangeOptions();
        options.setBottommostLevelCompaction(
            ManagedCompactRangeOptions.BottommostLevelCompaction.kForce);
        rocksDB.compactRange(columnFamily,
            mostDeletionSST.smallestKey(),
            mostDeletionSST.largestKey(),
            options);
      }
    }

@peterxcli
Copy link
Member Author

Thanks @guohao-rosicky

@ChenSammi
Copy link
Contributor

@peterxcli , thanks for the proposal. Looking forward to see the benchmark data.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this proposal @peterxcli, the overall design looks good. Just wondering how the performance of creating and compacting these ranges would be. Benchmarking data would be helpful!


### Create Compactor For Each Table

Create new compactor instances for each table, including `KEY_TABLE`, `DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run these background workers using a scheduled executor with configured interval and a random start time to spread out the workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also include the multipartInfoTable here, for anyone using multipart uploads, this table also could cross the numEntries and numDeletes thresholds.

@peterxcli
Copy link
Member Author

Just wondering how the performance of creating and compacting these ranges would be. Benchmarking data would be helpful!

Thanks for the review, I will start to work on this ASAP!

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @peterxcli, the design looks good to me!
One suggestion, can we consider adding a config (e.g., max_parallel_compactions) to control the number of parallel compactions (across multiple tables)? Alternatively, we could handle multiple compaction requests through a CommandQueue mechanism to avoid overloading RocksDB.

@peterxcli
Copy link
Member Author

peterxcli commented Apr 11, 2025

Thanks for the patch @peterxcli, the design looks good to me! One suggestion, can we consider adding a config (e.g., max_parallel_compactions) to control the number of parallel compactions (across multiple tables)? Alternatively, we could handle multiple compaction requests through a CommandQueue mechanism to avoid overloading RocksDB.

Yeah, we can also use priority queue to prioritise the compaction range that has highest tombstone percentage.

@peterxcli
Copy link
Member Author

peterxcli commented Jun 16, 2025

I have just completed the benchmark on the OBS table: https://github.com/peterxcli/ozone-helper/blob/main/benchmark/range-compaction/README.md#experimental-results

image
image

I plan to update the design document in the next few days based on these results.

For reference, my POC branch for the benchmark is here: peterxcli#2
Note that it uses my RocksDB fork build, which addresses the JNI return type issue with the getPropertiesOfTablesInRange function. If there's consensus on this proposal, I plan to adapt the existing rocks native module with a patch to achieve the same fix(Thanks for @swamirishi's opinion).

Let me know if you have any feedback or suggestions!

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Nov 12, 2025
@github-actions
Copy link

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions bot closed this Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file design stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants