From 5cd218d1be5fedb157aca6d35c5be256f615fb07 Mon Sep 17 00:00:00 2001 From: Swaminathan Balachandran Date: Tue, 27 May 2025 14:22:17 -0400 Subject: [PATCH 1/7] HDDS-13003. [Design Doc] Snapshot Compaction to reduce storage footprint Change-Id: Ieb6a1145c732ffbbbc6811565734a78bd12e30ef --- .../content/feature/SnapshotCompaction.md | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 hadoop-hdds/docs/content/feature/SnapshotCompaction.md diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md new file mode 100644 index 000000000000..5033782aaeb3 --- /dev/null +++ b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md @@ -0,0 +1,87 @@ +# Improving Snapshot Scale: + +[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003) + +# Problem Statement + +In Apache Ozone, snapshots currently take a checkpoint of the Active Object Store (AOS) RocksDB each time a snapshot is created and track the compaction of SST files over time. This model works efficiently when snapshots are short-lived, as they merely serve as hard links to the AOS RocksDB. However, over time, if an older snapshot persists while significant churn occurs in the AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge significantly from both the AOS RocksDB and other snapshot RocksDB instances. This divergence increases storage requirements linearly with the number of snapshots. + +# Solution Proposal: + +The primary inefficiency in the current snapshotting mechanism stems from constant RocksDB compactions in AOS, which can cause a key, file, or directory entry to appear in multiple SST files. Ideally, each unique key, file, or directory entry should reside in only one SST file, eliminating redundant storage and mitigating the multiplier effect caused by snapshots. If implemented correctly, the total RocksDB size would be proportional to the total number of unique keys in the system rather than the number of snapshots. + +## Snapshot Compaction: + +Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be compacted if the next snapshot in the chain is a checkpoint of the previous snapshot plus a diff stored in a separate SST file. The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. P.S This has got nothing to do with compacting snapshot’s rocksdb, we are not going to enable rocksdb auto compaction on snapshot rocksdb. + +1. ### Introducing a last compaction time: + + A new boolean flag (`needsCompaction`) and timestamp (`lastCompactionTime`) will be added to snapshot metadata. If absent, `needsCompaction` will default to `true`. + A new list of Map\\> (`sstFiles`) also needs to be added to snapshot info; this would be storing the original list of sst files in the uncompacted copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. + Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go via ratis. + +2. ### Snapshot Cache Lock for Read Prevention + + A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during compaction. This ensures no active reads occur while replacing the underlying RocksDB instance. + +3. ### Directory Structure Changes + + Snapshots currently reside in the `db.checkpoints` directory. The proposal introduces a `db.checkpoint.compacted` directory for compacted snapshots. The directory format should be as follows: + +| om.db-\.\ | +| :---- | + +4. ### Optimized Snapshot Diff Computation: + +To compute a snapshot diff: + +* If both snapshots are compacted, their compacted versions will be used. The diff b/w two compacted snapshot should be present in one sst file. +* If the target snapshot is uncompacted & the source snapshot is compacted(other way is not possible as we always compact snapshots in order) and if the DAG has all the sst files corresponding to the uncompacted snapshot version of the compacted snapshot which would be captured as part of the snapshot metadata once a snapshot is compacted, then an efficient diff can be performed with the information present in the DAG. +* Otherwise, a full diff will be computed between the compacted source and the uncompacted target snapshot. +* Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. + + +5. ### Snapshot Compaction Workflow + + Snapshot compaction should only occur once the snapshot has undergone SST filtering. The following steps outline the process: +1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). If a compacted copy already exists, update `compactionTime` and set `needsCompaction` to `false`. This checkpoint can be created in a tmp directory. +2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during compaction. +3. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. +4. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. +5. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. +6. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. +7. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates. +8. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: + +| om.db-\.\ | +| :---- | + +9. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. The `sstFiles` is set by creating Map\\> from the uncompacted version of the snapshot and this is only set once i.e. `lastCompactionTime` should be zero. +10. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). +11. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. + + + +6. ### Computing Changed Objects Between Snapshots + + The following steps outline how to compute changed objects: +1. **Determine delta SST files**: + * Retrieve from DAG if the snapshot was uncompacted previously and the previous snapshot has an uncompacted copy. + * Otherwise, compute delta SST files by comparing SST files in both compacted RocksDBs. +2. **Initialize SST file writers** for `keyTable`, `directoryTable`, and `fileTable`. +3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). +4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. + * If the object is present in the target snapshot then do an sstFileWriter.put(). + * If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling sstFileWriter.delete(). +5. **Ingest these SST files** into the checkpointed RocksDB. + +7. ### Handling Snapshot Purge + + Upon snapshot deletion, the `needsCompaction` flag for the next snapshot in the chain is set to `true`, ensuring compaction propagates incrementally across the snapshot chain. + +# Conclusion + +This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots \+ AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. + + + From 17573a67cc35d0841fe6c72a848fb3b8763732af Mon Sep 17 00:00:00 2001 From: Swaminathan Balachandran Date: Thu, 29 May 2025 21:15:20 -0400 Subject: [PATCH 2/7] HDDS-13003. Add force manual compaction for first snapshot in chain Change-Id: I9761006b9b9697f8392aab68c01b400793996d06 --- .../content/feature/SnapshotCompaction.md | 28 +++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md index 5033782aaeb3..18455634dbc8 100644 --- a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md +++ b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md @@ -26,7 +26,7 @@ Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to pr 3. ### Directory Structure Changes - Snapshots currently reside in the `db.checkpoints` directory. The proposal introduces a `db.checkpoint.compacted` directory for compacted snapshots. The directory format should be as follows: + Snapshots currently reside in the `db.checkpoints` directory. The proposal introduces a `db.checkpoints.compacted` directory for compacted snapshots. The directory format should be as follows: | om.db-\.\ | | :---- | @@ -44,21 +44,24 @@ To compute a snapshot diff: 5. ### Snapshot Compaction Workflow Snapshot compaction should only occur once the snapshot has undergone SST filtering. The following steps outline the process: -1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). If a compacted copy already exists, update `compactionTime` and set `needsCompaction` to `false`. This checkpoint can be created in a tmp directory. +1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). 2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during compaction. -3. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. -4. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. -5. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. -6. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. -7. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates. -8. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: + 1. If there is no path previous snapshot then + 1. take a checkpoint of the same rocksdb instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running rocksdb delete range api. We can trigger a forced manual compaction on the rocksdb instance(This can be behind a flag wherein the process can just work with the checkpoint of the rocksdb if the flag is disabled and not perform manual compaction). This should be done if the snapshot has never been compacted before i.e. if `lastCompactionTime` is zero or null. Otherwise just update the `needsCompaction` to False. + 2. If path previous snapshot exists: + 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. + 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. + 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. +3. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. +4. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates. +5. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: | om.db-\.\ | | :---- | -9. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. The `sstFiles` is set by creating Map\\> from the uncompacted version of the snapshot and this is only set once i.e. `lastCompactionTime` should be zero. -10. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). -11. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. +6. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. The `sstFiles` is set by creating Map\\> from the uncompacted version of the snapshot and this is only set once i.e. `lastCompactionTime` should be zero. +7. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). +8. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. @@ -82,6 +85,3 @@ To compute a snapshot diff: # Conclusion This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots \+ AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. - - - From 421373a99e87335c61e5ff632e59b6565309e072 Mon Sep 17 00:00:00 2001 From: Swaminathan Balachandran Date: Fri, 6 Jun 2025 18:44:02 -0400 Subject: [PATCH 3/7] Update SnapshotCompaction.md --- .../content/feature/SnapshotCompaction.md | 76 ++++++++++--------- 1 file changed, 41 insertions(+), 35 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md index 18455634dbc8..f70bbf8eac39 100644 --- a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md +++ b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md @@ -16,9 +16,10 @@ Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to pr 1. ### Introducing a last compaction time: - A new boolean flag (`needsCompaction`) and timestamp (`lastCompactionTime`) will be added to snapshot metadata. If absent, `needsCompaction` will default to `true`. - A new list of Map\\> (`sstFiles`) also needs to be added to snapshot info; this would be storing the original list of sst files in the uncompacted copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. - Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go via ratis. + A new boolean flag (`needsCompaction`), timestamp (`lastCompactionTime`), int `version` will be added to snapshot metadata. If absent, `needsCompaction` will default to `true`. + A new list of Map\\> (`uncompactedSstFiles`) also needs to be added to snapshot meta as part of snapshot create operation; this would be storing the original list of sst files in the uncompacted copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. + Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go via ratis. + An additional Map\\>\> (`compactedSstFiles`) also needs to be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of compacted snapshots. The key here would be the version number of snapshots. 2. ### Snapshot Cache Lock for Read Prevention @@ -28,54 +29,56 @@ Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to pr Snapshots currently reside in the `db.checkpoints` directory. The proposal introduces a `db.checkpoints.compacted` directory for compacted snapshots. The directory format should be as follows: -| om.db-\.\ | +| om.db-\.\ | | :---- | 4. ### Optimized Snapshot Diff Computation: To compute a snapshot diff: -* If both snapshots are compacted, their compacted versions will be used. The diff b/w two compacted snapshot should be present in one sst file. -* If the target snapshot is uncompacted & the source snapshot is compacted(other way is not possible as we always compact snapshots in order) and if the DAG has all the sst files corresponding to the uncompacted snapshot version of the compacted snapshot which would be captured as part of the snapshot metadata once a snapshot is compacted, then an efficient diff can be performed with the information present in the DAG. -* Otherwise, a full diff will be computed between the compacted source and the uncompacted target snapshot. -* Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. +* If both snapshots are compacted, their compacted versions will be used. The diff b/w two compacted snapshot should be present in one sst file. +* If the target snapshot is uncompacted & the source snapshot is compacted(other way is not possible as we always compact snapshots in order) and if the DAG has all the sst files corresponding to the uncompacted snapshot version of the compacted snapshot which would be captured as part of the snapshot metadata, then an efficient diff can be performed with the information present in the DAG. Use `uncompactedSstFiles` from each of the snapshot’s meta +* Otherwise, a full diff will be computed between the compacted source and the compacted target snapshot. Delta sst files would be computed corresponding to the latest version number of the target snapshot(version number of target snapshot would always be greater) +* Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. 5. ### Snapshot Compaction Workflow - Snapshot compaction should only occur once the snapshot has undergone SST filtering. The following steps outline the process: -1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). -2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during compaction. - 1. If there is no path previous snapshot then - 1. take a checkpoint of the same rocksdb instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running rocksdb delete range api. We can trigger a forced manual compaction on the rocksdb instance(This can be behind a flag wherein the process can just work with the checkpoint of the rocksdb if the flag is disabled and not perform manual compaction). This should be done if the snapshot has never been compacted before i.e. if `lastCompactionTime` is zero or null. Otherwise just update the `needsCompaction` to False. - 2. If path previous snapshot exists: - 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. - 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. - 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. -3. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. -4. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates. -5. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: - -| om.db-\.\ | + A background Snapshot compaction service should be added which would be done by iterating through the snapshot chain in the same order as the global snapshot chain. This is to ensure the snapshot created after is always compacted after all the snapshots previously created are compacted. Snapshot compaction should only occur once the snapshot has undergone SST filtering. The following steps outline the process: +1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). `version` of previous snapshot should be strictly greater than the current snapshot’s `version` otherwise skip compacting this snapshot in this iteration. +2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during compaction\[This is to keep contents of deleted Table contents same while compaction consistent\]. + 1. If there is no path previous snapshot then + 1. Take a checkpoint of the same rocksdb instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running rocksdb delete range api. This should be done if the snapshot has never been compacted before i.e. if `lastCompactionTime` is zero or null. Otherwise just update the `needsCompaction` to False. + 2. We can trigger a forced manual compaction on the rocksdb instance(i & ii can be behind a flag where in we can just work with the checkpoint of the rocksdb if the flag is disabled). + 2. If path previous snapshot exists: + 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. + 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. + 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. +3. Check if the entire current snapshot has been flushed to disk otherwise wait for the flush to happen. +4. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. +5. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates.\[While performing the snapshot rocksdb directory switch there should be no rocksdb handle with read happening on it\]. +6. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: + +| om.db-\.\ | | :---- | -6. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. The `sstFiles` is set by creating Map\\> from the uncompacted version of the snapshot and this is only set once i.e. `lastCompactionTime` should be zero. -7. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). -8. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. - +7. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. If there is no path previous snapshot in the chain then increase `version` by 1 otherwise set `version` which is equal to the previous snapshot in the chain. Based on the sstFiles in the rocksdb compute Map\\> and add this Map to `compactedSstFiles` corresponding to the `version` of the snapshot. +8. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). +9. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. + 6. ### Computing Changed Objects Between Snapshots - The following steps outline how to compute changed objects: -1. **Determine delta SST files**: - * Retrieve from DAG if the snapshot was uncompacted previously and the previous snapshot has an uncompacted copy. - * Otherwise, compute delta SST files by comparing SST files in both compacted RocksDBs. -2. **Initialize SST file writers** for `keyTable`, `directoryTable`, and `fileTable`. -3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). -4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. - * If the object is present in the target snapshot then do an sstFileWriter.put(). - * If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling sstFileWriter.delete(). + The following steps outline how to compute changed objects: +1. **Determine delta SST files**: + * Retrieve from DAG if the snapshot was uncompacted previously and the previous snapshot has an uncompacted copy. + * Otherwise, compute delta SST files by comparing SST files in both compacted RocksDBs. +2. **Initialize SST file writers** for `keyTable`, `directoryTable`, and `fileTable`. +3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). +4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. + * If the object is present in the target snapshot then do an sstFileWriter.put(). + * If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling sstFileWriter.delete(). 5. **Ingest these SST files** into the checkpointed RocksDB. 7. ### Handling Snapshot Purge @@ -85,3 +88,6 @@ To compute a snapshot diff: # Conclusion This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots \+ AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. + + + From 8f0b32971c5489df8c7fd56f797efdfcb5d0ba9c Mon Sep 17 00:00:00 2001 From: Siyao Meng <50227127+smengcl@users.noreply.github.com> Date: Mon, 28 Jul 2025 11:06:08 -0700 Subject: [PATCH 4/7] Apply suggestion from @jojochuang Co-authored-by: Wei-Chiu Chuang --- .../content/feature/SnapshotCompaction.md | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md index f70bbf8eac39..793b6085016f 100644 --- a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md +++ b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md @@ -1,3 +1,27 @@ +--- +title: "Improve Snapshot Compaction Scale" +weight: 1 +menu: + main: + parent: Features +summary: Reduce the disk usage occupied by Ozone Snapshot metadata. +--- + # Improving Snapshot Scale: [HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003) From c12e519dfd62a9be891b75317f90feb8aaeb4ae0 Mon Sep 17 00:00:00 2001 From: Siyao Meng <50227127+smengcl@users.noreply.github.com> Date: Fri, 22 Aug 2025 19:56:15 -0700 Subject: [PATCH 5/7] Change term: Snapshot Compaction -> Snapshot Defragmentation --- .../content/feature/SnapshotCompaction.md | 72 +++++++++---------- 1 file changed, 35 insertions(+), 37 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md index 793b6085016f..dfaf8fe0c776 100644 --- a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md +++ b/hadoop-hdds/docs/content/feature/SnapshotCompaction.md @@ -1,5 +1,5 @@ --- -title: "Improve Snapshot Compaction Scale" +title: "Improve Ozone Snapshot Scale with Snapshot Defragmentation" weight: 1 menu: main: @@ -22,7 +22,7 @@ summary: Reduce the disk usage occupied by Ozone Snapshot metadata. See the License for the specific language governing permissions and limitations under the License. --> -# Improving Snapshot Scale: +# Improving Snapshot Scale [HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003) @@ -30,74 +30,75 @@ summary: Reduce the disk usage occupied by Ozone Snapshot metadata. In Apache Ozone, snapshots currently take a checkpoint of the Active Object Store (AOS) RocksDB each time a snapshot is created and track the compaction of SST files over time. This model works efficiently when snapshots are short-lived, as they merely serve as hard links to the AOS RocksDB. However, over time, if an older snapshot persists while significant churn occurs in the AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge significantly from both the AOS RocksDB and other snapshot RocksDB instances. This divergence increases storage requirements linearly with the number of snapshots. -# Solution Proposal: +# Solution Proposal The primary inefficiency in the current snapshotting mechanism stems from constant RocksDB compactions in AOS, which can cause a key, file, or directory entry to appear in multiple SST files. Ideally, each unique key, file, or directory entry should reside in only one SST file, eliminating redundant storage and mitigating the multiplier effect caused by snapshots. If implemented correctly, the total RocksDB size would be proportional to the total number of unique keys in the system rather than the number of snapshots. -## Snapshot Compaction: +## Snapshot Defragmentation -Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be compacted if the next snapshot in the chain is a checkpoint of the previous snapshot plus a diff stored in a separate SST file. The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. P.S This has got nothing to do with compacting snapshot’s rocksdb, we are not going to enable rocksdb auto compaction on snapshot rocksdb. +Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be defragmented in the way that the next snapshot in the chain is a checkpoint of the previous snapshot plus a diff stored in separate SST files (one for each table). The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. -1. ### Introducing a last compaction time: +Note: Snapshot Defragmentation was previously called Snapshot Compaction earlier during the design phase. It is not RocksDB compaction. Thus the rename to avoid such confusion. We are also not going to enable RocksDB auto compaction on snapshot RocksDB. - A new boolean flag (`needsCompaction`), timestamp (`lastCompactionTime`), int `version` will be added to snapshot metadata. If absent, `needsCompaction` will default to `true`. - A new list of Map\\> (`uncompactedSstFiles`) also needs to be added to snapshot meta as part of snapshot create operation; this would be storing the original list of sst files in the uncompacted copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. - Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go via ratis. - An additional Map\\>\> (`compactedSstFiles`) also needs to be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of compacted snapshots. The key here would be the version number of snapshots. +1. ### Introducing last defragmentation time + + A new boolean flag (`needsDefrag`), timestamp (`lastDefragTime`), int `version` will be added to snapshot metadata. If absent, `needsDefrag` will default to `true`. + A new list of Map\\> (`notDefraggedSstFileList`) also needs to be added to snapshot meta as part of snapshot create operation; this would be storing the original list of sst files in the not defragged copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. + Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go through Ratis. + An additional Map\\>\> (`defraggedSstFileList`) also needs to be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of defragged snapshots. The key here would be the version number of snapshots. 2. ### Snapshot Cache Lock for Read Prevention - A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during compaction. This ensures no active reads occur while replacing the underlying RocksDB instance. + A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during defragmentation. This ensures no active reads occur while replacing the underlying RocksDB instance. 3. ### Directory Structure Changes - Snapshots currently reside in the `db.checkpoints` directory. The proposal introduces a `db.checkpoints.compacted` directory for compacted snapshots. The directory format should be as follows: + Snapshots currently reside under `db.snapshots/checkpointState/` directory. The proposal introduces a `db.snapshots/checkpointStateDefragged/` directory for defragged snapshots. The directory format should be as follows: -| om.db-\.\ | +| om.db-\-\ | | :---- | -4. ### Optimized Snapshot Diff Computation: +4. ### Optimized Snapshot Diff Computation To compute a snapshot diff: -* If both snapshots are compacted, their compacted versions will be used. The diff b/w two compacted snapshot should be present in one sst file. -* If the target snapshot is uncompacted & the source snapshot is compacted(other way is not possible as we always compact snapshots in order) and if the DAG has all the sst files corresponding to the uncompacted snapshot version of the compacted snapshot which would be captured as part of the snapshot metadata, then an efficient diff can be performed with the information present in the DAG. Use `uncompactedSstFiles` from each of the snapshot’s meta -* Otherwise, a full diff will be computed between the compacted source and the compacted target snapshot. Delta sst files would be computed corresponding to the latest version number of the target snapshot(version number of target snapshot would always be greater) +* If both snapshots are defragged, their defragged versions will be used. The diff b/w two defragged snapshot should be present in one sst file. +* If the target snapshot is not defragged & the source snapshot is defragged (other way is not possible as we always defrag snapshots in order) and if the DAG has all the sst files corresponding to the not defragged snapshot version of the defragged snapshot which would be captured as part of the snapshot metadata, then an efficient diff can be performed with the information present in the DAG. Use `notDefraggedSstFileList` from each of the snapshot’s meta +* Otherwise, a full diff will be computed between the defragged source and the defragged target snapshot. Delta sst files would be computed corresponding to the latest version number of the target snapshot(version number of target snapshot would always be greater) * Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. -5. ### Snapshot Compaction Workflow +5. ### Snapshot Defragmentation Workflow - A background Snapshot compaction service should be added which would be done by iterating through the snapshot chain in the same order as the global snapshot chain. This is to ensure the snapshot created after is always compacted after all the snapshots previously created are compacted. Snapshot compaction should only occur once the snapshot has undergone SST filtering. The following steps outline the process: -1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). `version` of previous snapshot should be strictly greater than the current snapshot’s `version` otherwise skip compacting this snapshot in this iteration. -2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during compaction\[This is to keep contents of deleted Table contents same while compaction consistent\]. + A background snapshot defragmentation service should be added which would be done by iterating through the snapshot chain in the same order as the global snapshot chain. This is to ensure the snapshot created after is always defragged after all the snapshots previously created are defragged. Snapshot defragmentation should only occur once the snapshot has undergone SST filtering. The following steps outline the process: +1. **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). `version` of previous snapshot should be strictly greater than the current snapshot’s `version` otherwise skip compacting this snapshot in this iteration. +2. **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during defragmentation\[This is to keep contents of deleted Table contents same while defragmentation consistent\]. 1. If there is no path previous snapshot then - 1. Take a checkpoint of the same rocksdb instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running rocksdb delete range api. This should be done if the snapshot has never been compacted before i.e. if `lastCompactionTime` is zero or null. Otherwise just update the `needsCompaction` to False. - 2. We can trigger a forced manual compaction on the rocksdb instance(i & ii can be behind a flag where in we can just work with the checkpoint of the rocksdb if the flag is disabled). + 1. Take a checkpoint of the same RocksDB instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running RocksDB delete range api. This should be done if the snapshot has never been defragged before i.e. if `lastDefragTime` is zero or null. Otherwise just update the `needsDefrag` to False. + 2. We can trigger a forced manual compaction on the RocksDB instance(i & ii can be behind a flag where in we can just work with the checkpoint of the RocksDB if the flag is disabled). 2. If path previous snapshot exists: 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. 3. Check if the entire current snapshot has been flushed to disk otherwise wait for the flush to happen. -4. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed rocksdb and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot rocksdb. -5. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates.\[While performing the snapshot rocksdb directory switch there should be no rocksdb handle with read happening on it\]. -6. **Move the checkpoint directory** into `db.checkpoint.compacted` with the format: +4. Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed RocksDB and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot RocksDB. +5. **Acquire the snapshot cache lock** to prevent snapshot access during directory updates.\[While performing the snapshot RocksDB directory switch there should be no RocksDB handle with read happening on it\]. +6. **Move the checkpoint directory** into `checkpointStateDefragged` with the format: -| om.db-\.\ | +| om.db-\-\ | | :---- | -7. **Update snapshot metadata**, setting `lastCompactionTime` and marking `needsCompaction = false` and set the next snapshot in the chain is marked for compaction. If there is no path previous snapshot in the chain then increase `version` by 1 otherwise set `version` which is equal to the previous snapshot in the chain. Based on the sstFiles in the rocksdb compute Map\\> and add this Map to `compactedSstFiles` corresponding to the `version` of the snapshot. -8. **Delete old uncompacted/compacted snapshots**, ensuring unreferenced uncompacted/compacted snapshots are purged during OM startup(This is to handle jvm crash after viii). +7. **Update snapshot metadata**, setting `lastDefragTime` and marking `needsDefrag = false` and set the next snapshot in the chain is marked for defragmentation. If there is no path previous snapshot in the chain then increase `version` by 1 otherwise set `version` which is equal to the previous snapshot in the chain. Based on the sstFiles in the RocksDB compute Map\\> and add this Map to `defraggedSstFileList` corresponding to the `version` of the snapshot. +8. **Delete old not defragged/defragged snapshots**, ensuring unreferenced not defragged/defragged snapshots are purged during OM startup(This is to handle jvm crash after viii). 9. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. - -6. ### Computing Changed Objects Between Snapshots +### Computing Changed Objects Between Snapshots The following steps outline how to compute changed objects: 1. **Determine delta SST files**: - * Retrieve from DAG if the snapshot was uncompacted previously and the previous snapshot has an uncompacted copy. - * Otherwise, compute delta SST files by comparing SST files in both compacted RocksDBs. + * Retrieve from DAG if the snapshot was not defragged previously and the previous snapshot has an not defragged copy. + * Otherwise, compute delta SST files by comparing SST files in both defragged RocksDBs. 2. **Initialize SST file writers** for `keyTable`, `directoryTable`, and `fileTable`. 3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). 4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. @@ -107,11 +108,8 @@ To compute a snapshot diff: 7. ### Handling Snapshot Purge - Upon snapshot deletion, the `needsCompaction` flag for the next snapshot in the chain is set to `true`, ensuring compaction propagates incrementally across the snapshot chain. + Upon snapshot deletion, the `needsDefrag` flag for the next snapshot in the chain is set to `true`, ensuring defragmentation propagates incrementally across the snapshot chain. # Conclusion This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots \+ AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. - - - From e0d2e6471d50fffa81a22ad5e29155fbb81b6567 Mon Sep 17 00:00:00 2001 From: Siyao Meng <50227127+smengcl@users.noreply.github.com> Date: Fri, 22 Aug 2025 19:56:52 -0700 Subject: [PATCH 6/7] Rename the design doc file itself --- .../feature/{SnapshotCompaction.md => SnapshotDefragmentation.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename hadoop-hdds/docs/content/feature/{SnapshotCompaction.md => SnapshotDefragmentation.md} (100%) diff --git a/hadoop-hdds/docs/content/feature/SnapshotCompaction.md b/hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md similarity index 100% rename from hadoop-hdds/docs/content/feature/SnapshotCompaction.md rename to hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md From e7f846c833760cb428e519a32cd187b020e971b6 Mon Sep 17 00:00:00 2001 From: Siyao Meng <50227127+smengcl@users.noreply.github.com> Date: Fri, 22 Aug 2025 20:19:23 -0700 Subject: [PATCH 7/7] Address Wei-Chiu's comments. --- .../feature/SnapshotDefragmentation.md | 63 +++++++++++++++++-- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md b/hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md index dfaf8fe0c776..6d0a3b114e79 100644 --- a/hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md +++ b/hadoop-hdds/docs/content/feature/SnapshotDefragmentation.md @@ -36,20 +36,21 @@ The primary inefficiency in the current snapshotting mechanism stems from consta ## Snapshot Defragmentation -Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be defragmented in the way that the next snapshot in the chain is a checkpoint of the previous snapshot plus a diff stored in separate SST files (one for each table). The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. +Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be defragmented in the way that the next active snapshot in the chain is a checkpoint of its previous active snapshot plus a diff stored in separate SST files (one SST for each column family changed). The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. Note: Snapshot Defragmentation was previously called Snapshot Compaction earlier during the design phase. It is not RocksDB compaction. Thus the rename to avoid such confusion. We are also not going to enable RocksDB auto compaction on snapshot RocksDB. 1. ### Introducing last defragmentation time - A new boolean flag (`needsDefrag`), timestamp (`lastDefragTime`), int `version` will be added to snapshot metadata. If absent, `needsDefrag` will default to `true`. + A new boolean flag (`needsDefrag`), timestamp (`lastDefragTime`), int `version` will be added to snapshot metadata. If absent, `needsDefrag` will default to `true`. + `needsDefrag` tells the system whether a snapshot is pending defrag (`true`) or if it is already defragged and up to date (`false`). This helps manage and automate the defrag workflow, ensuring snapshots are efficiently stored and maintained. A new list of Map\\> (`notDefraggedSstFileList`) also needs to be added to snapshot meta as part of snapshot create operation; this would be storing the original list of sst files in the not defragged copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. Since this is not going to be consistent across all OMs this would have to be written to a local yaml file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go through Ratis. An additional Map\\>\> (`defraggedSstFileList`) also needs to be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of defragged snapshots. The key here would be the version number of snapshots. 2. ### Snapshot Cache Lock for Read Prevention - A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during defragmentation. This ensures no active reads occur while replacing the underlying RocksDB instance. + A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during the last step of defragmentation. This ensures no active reads occur while we are replacing the underlying RocksDB instance. The swap should be instantaneous. 3. ### Directory Structure Changes @@ -93,6 +94,32 @@ To compute a snapshot diff: 9. **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read. +#### Visualization + +```mermaid +flowchart TD + A[Start: Not defragged Snapshot Exists] --> B[Has SST Filtering Occurred?] + B -- No --> Z[Wait for SST Filtering] + B -- Yes --> C[Create RocksDB Checkpoint of Previous Snapshot] + C --> D{Defragged Copy Exists?} + D -- Yes --> E[Update defragTime, set needsDefrag=false] + D -- No --> F[Create Checkpoint in Temp Directory] + E --> G[Acquire SNAPSHOT_GC_LOCK] + F --> G + G --> H[Compute Diff between Checkpoint & Current Snapshot] + H --> I[Flush Changed Objects into SST Files by table] + I --> J[Ingest SST Files into Checkpointed RocksDB] + J --> K[Truncate/Replace deletedTable, etc.] + K --> L[Acquire Snapshot Cache Lock] + L --> M[Move Checkpoint Dir to checkpointStateDefragged] + M --> N[Update Snapshot Metadata: lastDefragTime, needsDefrag=false, set next snapshot needsDefrag=true, set sstFiles] + N --> O[Delete old snapshot DB dir] + O --> P[Release Snapshot Cache Lock] + P --> Q[Defragged Snapshot Ready] +``` + + + ### Computing Changed Objects Between Snapshots The following steps outline how to compute changed objects: @@ -106,10 +133,38 @@ To compute a snapshot diff: * If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling sstFileWriter.delete(). 5. **Ingest these SST files** into the checkpointed RocksDB. -7. ### Handling Snapshot Purge +#### Visualization + +```mermaid +flowchart TD + A[Start: Need Diff Between Snapshots] --> B[Determine delta SST files] + B -- DAG Info available --> C[Retrieve from DAG] + B -- Otherwise --> D[Compute delta by comparing SST files in both RocksDBs] + C --> E[Initialize SST file writers: keyTable, directoryTable, fileTable] + D --> E + E --> F[Iterate SST files in parallel, merge keys: MinHeapIterator-like] + F --> G[Compare keys between snapshots] + G --> H{Object in Target?} + H -- Yes --> I[sstFileWriter.put] + H -- No --> J[sstFileWriter.delete tombstone] + I --> K[Ingest SST Files into Checkpointed RocksDB] + J --> K +``` + + +### Handling Snapshot Purge Upon snapshot deletion, the `needsDefrag` flag for the next snapshot in the chain is set to `true`, ensuring defragmentation propagates incrementally across the snapshot chain. +#### Visualization + +```mermaid +flowchart TD + A[Snapshot Deletion Requested] --> B[Set needsDefrag=true for next snapshot in chain] + B --> C[Next snapshots will be defragged incrementally] +``` + + # Conclusion This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots \+ AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient.