-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o anything else for now. need to handle upgrade later. e.g. block defrag service until finalization is performed)
…efrag.limit.per.task`, `ozone.snapshot.defrag.service.interval`.
…n om triggerSnapshotDefrag -host=om`
…le with RDBSstFileWriter then ingest.
Added writeOptions.setDisableWAL(true) to reduce disk writes during testing. This change is intended for test environments only and should not be used in production.
Removes native library includes from rocks-native/pom.xml and adds an antrun plugin in dist/pom.xml to copy native libraries to the Ozone distribution. Updates the shell script to include the new library path in JAVA_LIBRARY_PATH for relevant commands. This improves the packaging and loading of native libraries in Ozone. Rocks tools native lib should not be inside jar apache#123 (-HDDS-11591. Copy dependencies when building each module)
Introduces a debug log message when System.loadLibrary fails, providing more visibility into library loading issues before attempting to load from the jar.
Changed the default values for lastDefragTime to -1L, needsDefragmentation to true, and version to 1 in the OmSnapshotLocalData constructor.
Moved OZONE_OM_SNAPSHOT_DIFF_REPORT_MAX_PAGE_SIZE_DEFAULT to immediately follow its related config key for better organization.
…ocal dev testing Refactored docker-compose.yaml to move volume mounts from the common config to individual services, adding service-specific data volumes and profiles for optional services. Updated docker-config with new Ozone and RocksDB settings, including increased snapshot diff page size, WAL configuration, and snapshot defragmentation interval.
Replaces calls to defragService.start() with defragService.triggerSnapshotDefragOnce() in OzoneManager.
Modified TestSnapshotDefragService2 to use FSO bucket layout. Adjusted configuration parameters for defrag service interval, timeout, and diff report page size. Updated key creation logic for zero-byte keys.
Updated RandomKeyGenerator to catch and log OMException when volumes or buckets already exist, allowing the process to continue instead of failing. Also standardized naming for volumes, buckets, and keys to remove random suffixes.
…ificationTime change as MODIFY instead of RENAME for zero-byte keys
Changed the visibility of the RocksCheckpoint class and its related methods (createCheckpoint and get) from package-private to public. This allows external classes to access and utilize these functionalities.
…g a lot of warning in the new integration test) Added support for the STAND_ALONE replication type in getReplicatedSize. For STAND_ALONE, the replicated size is set equal to the data size.
… others Inserted a TODO comment clarifying that only OM requires the snapshot directory, while SCM, DN, etc. do not.
|
bugbot run |
| // Snapshot defragmentation service configuration | ||
| public static final String SNAPSHOT_DEFRAG_LIMIT_PER_TASK = | ||
| "ozone.snapshot.defrag.limit.per.task"; | ||
| public static final int SNAPSHOT_DEFRAG_LIMIT_PER_TASK_DEFAULT = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: the interval can be set to 2 or 3 later
| public static final String OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL = | ||
| "ozone.snapshot.defrag.service.interval"; | ||
| public static final String | ||
| OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL_DEFAULT = "60s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: May need to increase this interval if snapshotDiff is taking long.
Does interval countdown start only after the service run ends? I vaguely remember that is the case. Need to double check
|
Closing this one as proper impl are already being done in other PRs. |
WARNING: DO NOT MERGE. This PR is for visibility and comments only. It is too large by itself, and has to be broken down into multiple PRs.
What changes were proposed in this pull request?
Implement Snapshot Defrag service and manual trigger CLI. Design doc: #8514
This working POC contains extremely crude implementation. Major refactoring and optimizations expected. The point is to prove that Snapshot Defrag could bring space saving.
This also include dev commits with that begins with
[dev]in their commit messages that likely won't end up in the code base.[split]are the ones that can be put in separate PRs.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13003
How was this patch tested?
TestSnapshotDefragService2for faster debugging iteration.Results
I've tested it with "1 million keys overwrite" scenario locally in Docker dev:
After snapshot defrag processed all 3 snapshot checkpoint DBs, the space usage went from 169 MB to 92 MB (46% space saved), that is actual disk usage (hard links is only counted once).
Note RocksDB WAL is disabled for more accurate disk usage during testing, those WALs can account be huge when writing millions of keys.
Before Defrag, all 3 snapshot DBs:
After Defrag, all 3 snapshot DBs:
There are more scenarios to be tried out. Even the above scenario can be improved to trigger DB compaction after EACH snapshot taken. (And DELETED_TABLE / DELETED_DIR_TABLE should also be copied over in the impl.)