Skip to content

Conversation

@smengcl
Copy link
Contributor

@smengcl smengcl commented Aug 19, 2025

WARNING: DO NOT MERGE. This PR is for visibility and comments only. It is too large by itself, and has to be broken down into multiple PRs.

What changes were proposed in this pull request?

Implement Snapshot Defrag service and manual trigger CLI. Design doc: #8514

This working POC contains extremely crude implementation. Major refactoring and optimizations expected. The point is to prove that Snapshot Defrag could bring space saving.

This also include dev commits with that begins with [dev] in their commit messages that likely won't end up in the code base. [split] are the ones that can be put in separate PRs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13003

How was this patch tested?

  • Use integration test TestSnapshotDefragService2 for faster debugging iteration.
  • Use Docker dev to manually test the defrag service (which generates keys order of magnitude faster).

Results

I've tested it with "1 million keys overwrite" scenario locally in Docker dev:

  1. write 1m keys, take snapshot 1
  2. overwrite those 1m keys, take snapshot 2
  3. overwrite those 1m keys again, take snapshot 3

After snapshot defrag processed all 3 snapshot checkpoint DBs, the space usage went from 169 MB to 92 MB (46% space saved), that is actual disk usage (hard links is only counted once).

Note RocksDB WAL is disabled for more accurate disk usage during testing, those WALs can account be huge when writing millions of keys.

Before Defrag, all 3 snapshot DBs:

▶ du -h -d1 ./checkpointState
 43M	./checkpointState/om.db-6639d124-6615-4ced-9af6-3dabd680727b
 63M	./checkpointState/om.db-d39279ce-cab6-44e0-839a-2baecb8c283a
 62M	./checkpointState/om.db-77b75627-5534-4db4-88e5-1661aceae92f
169M	./checkpointState

After Defrag, all 3 snapshot DBs:

▶ du -h -d1 ./checkpointStateDefragged
 83M	./checkpointStateDefragged/om.db-6639d124-6615-4ced-9af6-3dabd680727b
4.1M	./checkpointStateDefragged/om.db-d39279ce-cab6-44e0-839a-2baecb8c283a
4.1M	./checkpointStateDefragged/om.db-77b75627-5534-4db4-88e5-1661aceae92f
 92M	./checkpointStateDefragged

There are more scenarios to be tried out. Even the above scenario can be improved to trigger DB compaction after EACH snapshot taken. (And DELETED_TABLE / DELETED_DIR_TABLE should also be copied over in the impl.)

smengcl added 30 commits August 18, 2025 21:04
…o anything else for now. need to handle upgrade later. e.g. block defrag service until finalization is performed)
…efrag.limit.per.task`, `ozone.snapshot.defrag.service.interval`.
Added writeOptions.setDisableWAL(true) to reduce disk writes during testing. This change is intended for test environments only and should not be used in production.
Removes native library includes from rocks-native/pom.xml and adds an antrun plugin in dist/pom.xml to copy native libraries to the Ozone distribution. Updates the shell script to include the new library path in JAVA_LIBRARY_PATH for relevant commands. This improves the packaging and loading of native libraries in Ozone.

Rocks tools native lib should not be inside jar apache#123 (-HDDS-11591. Copy dependencies when building each module)
Introduces a debug log message when System.loadLibrary fails, providing more visibility into library loading issues before attempting to load from the jar.
Changed the default values for lastDefragTime to -1L, needsDefragmentation to true, and version to 1 in the OmSnapshotLocalData constructor.
Moved OZONE_OM_SNAPSHOT_DIFF_REPORT_MAX_PAGE_SIZE_DEFAULT to immediately follow its related config key for better organization.
…ocal dev testing

Refactored docker-compose.yaml to move volume mounts from the common config to individual services, adding service-specific data volumes and profiles for optional services. Updated docker-config with new Ozone and RocksDB settings, including increased snapshot diff page size, WAL configuration, and snapshot defragmentation interval.
Replaces calls to defragService.start() with defragService.triggerSnapshotDefragOnce() in OzoneManager.
Modified TestSnapshotDefragService2 to use FSO bucket layout. Adjusted configuration parameters for defrag service interval, timeout, and diff report page size. Updated key creation logic for zero-byte keys.
Updated RandomKeyGenerator to catch and log OMException when volumes or buckets already exist, allowing the process to continue instead of failing. Also standardized naming for volumes, buckets, and keys to remove random suffixes.
…ificationTime change as MODIFY instead of RENAME for zero-byte keys
Changed the visibility of the RocksCheckpoint class and its related methods (createCheckpoint and get) from package-private to public. This allows external classes to access and utilize these functionalities.
…g a lot of warning in the new integration test)

Added support for the STAND_ALONE replication type in getReplicatedSize. For STAND_ALONE, the replicated size is set equal to the data size.
… others

Inserted a TODO comment clarifying that only OM requires the snapshot directory, while SCM, DN, etc. do not.
@smengcl smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Aug 19, 2025
@smengcl
Copy link
Contributor Author

smengcl commented Aug 19, 2025

bugbot run

// Snapshot defragmentation service configuration
public static final String SNAPSHOT_DEFRAG_LIMIT_PER_TASK =
"ozone.snapshot.defrag.limit.per.task";
public static final int SNAPSHOT_DEFRAG_LIMIT_PER_TASK_DEFAULT = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: the interval can be set to 2 or 3 later

public static final String OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL =
"ozone.snapshot.defrag.service.interval";
public static final String
OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL_DEFAULT = "60s";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: May need to increase this interval if snapshotDiff is taking long.

Does interval countdown start only after the service run ends? I vaguely remember that is the case. Need to double check

@smengcl
Copy link
Contributor Author

smengcl commented Nov 3, 2025

Closing this one as proper impl are already being done in other PRs.

@smengcl smengcl closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant