-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13009. Background snapshot defrag service #9227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-13009. Background snapshot defrag service #9227
Conversation
…reate empty DB and dump
…ng to read cfOptions from file
…SstFiltered(true); fix obs/fso prefix calc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements snapshot defragmentation functionality to optimize storage for Ozone snapshots. The defragmentation service processes snapshots in the active chain, creating compacted versions that contain only necessary data for tracked column families.
Key changes:
- Implemented full and incremental defragmentation logic in
SnapshotDefragService - Added support for defragmented snapshot directories alongside regular checkpoint state
- Enhanced database and snapshot utilities to support defragmentation operations
- Added comprehensive integration tests for snapshot defragmentation
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| SnapshotDefragService.java | Implements core defragmentation logic including full and incremental defrag operations, DB integrity verification, and metadata updates |
| SnapshotDiffManager.java | Made getDeltaFiles public and added null check for defrag job tracking |
| OmSnapshotLocalDataManager.java | Sets needsDefrag flag when snapshot versions are modified |
| OmSnapshotManager.java | Updated to pass defrag flag when creating snapshot metadata managers |
| OmMetadataManagerImpl.java | Added support for loading from defragged checkpoint directories |
| RDBStore.java | Creates defraggedSnapshotsParentDir directory structure |
| RocksDatabase.java | Made checkpoint methods public and added proper resource cleanup |
| OzoneConsts.java | Fixed defraggedSnapshotsParentDir constant to include proper path prefix |
| TestSnapshotDefragService.java | Added integration tests for defragmentation functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
| SnapshotDiffResponse.SubStatus subStatus) { | ||
| SnapshotDiffJob snapshotDiffJob = snapDiffJobTable.get(jobKey); | ||
| if (snapshotDiffJob == null) { | ||
| // TODO: Record activity for defrag jobs as well somehow |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null check for snapshotDiffJob indicates that defrag jobs are not being tracked in the same way as snapshot diff jobs. The TODO suggests this should be addressed. Consider implementing a separate tracking mechanism for defrag jobs or extending the existing job table to handle both job types.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java
Outdated
Show resolved
Hide resolved
…ase-v2 Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
…l defrag, instead of just return
Added cache invalidation for the snapshot entry before deleting the old database directory to prevent issues with lingering DB handles.
Swapped the order of variables in a debug log statement to display snapshotLocalDataVersion before dbName for improved clarity.
Introduces a static variable for SNAPSHOT_DEFRAG_LIMIT_PER_TASK_VALUE and uses it in configuration and assertions. Also increases the await timeout and refines the defragmentation completion condition for improved test reliability.
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure looks related:
Error: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 45.06 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.TestOMDbCheckpointServlet
Error: org.apache.hadoop.ozone.om.TestOMDbCheckpointServlet.testWithoutACL -- Time elapsed: 30.68 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected snapshot files not found ==> expected: <[db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000058.sst, db.snapshots/diffState/compaction-sst-backup, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000053.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/MANIFEST-000005, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f.yaml, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000063.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/MANIFEST-000005, db.snapshots/diffState/compaction-log/expected.log, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/CURRENT, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000054.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/OPTIONS-000051, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000056.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082, db.snapshots/checkpointStateDefragged, db.snapshots/diffState/compaction-log/_README.txt, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000057.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000053.sst, db.snapshots/diffState/compaction-log, db.snapshots/diffState, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000057.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000061.sst, db.snapshots/diffState/snapDiff/_README.txt, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000061.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000066.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000056.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000065.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000060.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082.yaml, db.snapshots/diffState/compaction-sst-backup/CURRENT, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000058.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000060.sst, db.snapshots/diffState/snapDiff, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/OPTIONS-000051, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000067.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000059.sst, db.snapshots/diffState/compaction-sst-backup/expected.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000054.sst, db.snapshots/checkpointState, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000064.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/CURRENT, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000059.sst]> but was: <[db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000058.sst, db.snapshots/diffState/compaction-sst-backup, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000053.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/MANIFEST-000005, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f.yaml, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000063.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/MANIFEST-000005, db.snapshots/diffState/compaction-log/expected.log, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/CURRENT, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000054.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/OPTIONS-000051, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000056.sst, db.snapshots/diffState/compaction-log/_README.txt, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000057.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000053.sst, db.snapshots/diffState/compaction-log, db.snapshots/diffState, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000057.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000061.sst, db.snapshots/diffState/snapDiff/_README.txt, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000066.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000061.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000056.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000065.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000060.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082.yaml, db.snapshots/diffState/compaction-sst-backup/CURRENT, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000060.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000058.sst, db.snapshots/diffState/snapDiff, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/OPTIONS-000051, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000067.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000059.sst, db.snapshots/diffState/compaction-sst-backup/expected.sst, db.snapshots/checkpointState, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/000054.sst, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000064.sst, db.snapshots/checkpointState/om.db-34a4a5d7-a672-4024-8f63-53248c82e94f/CURRENT, db.snapshots/checkpointState/om.db-5f94a3d6-08e4-4e0e-a7d5-a61c4e19b082/000059.sst]>
at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1156)
at org.apache.hadoop.ozone.om.TestOMDbCheckpointServlet.testWriteDbDataToStream(TestOMDbCheckpointServlet.java:502)
at org.apache.hadoop.ozone.om.TestOMDbCheckpointServlet.testWithoutACL(TestOMDbCheckpointServlet.java:252)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
|
Note #9208 may have introduced conflict with this PR in how defrag DB path (dir location) is handled. |
Ah. It could be related to the change in |
|
Closing this in favor of #9324
|
What changes were proposed in this pull request?
Implement Background Snapshot Defragmentation Service outlined in the design
Some commits are cherry-picked from the POC and rebased/changed.
Based on draft v1 #9117, #9133 and various improvements. Rebased and addressed most comments in the draft v1.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13009
How was this patch tested?