HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954

smengcl · 2025-08-19T04:35:18Z

WARNING: DO NOT MERGE. This PR is for visibility and comments only. It is too large by itself, and has to be broken down into multiple PRs.

What changes were proposed in this pull request?

Implement Snapshot Defrag service and manual trigger CLI. Design doc: #8514

This working POC contains extremely crude implementation. Major refactoring and optimizations expected. The point is to prove that Snapshot Defrag could bring space saving.

This also include dev commits with that begins with [dev] in their commit messages that likely won't end up in the code base. [split] are the ones that can be put in separate PRs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13003

How was this patch tested?

Use integration test TestSnapshotDefragService2 for faster debugging iteration.
Use Docker dev to manually test the defrag service (which generates keys order of magnitude faster).

Results

I've tested it with "1 million keys overwrite" scenario locally in Docker dev:

write 1m keys, take snapshot 1
overwrite those 1m keys, take snapshot 2
overwrite those 1m keys again, take snapshot 3

After snapshot defrag processed all 3 snapshot checkpoint DBs, the space usage went from 169 MB to 92 MB (46% space saved), that is actual disk usage (hard links is only counted once).

Note RocksDB WAL is disabled for more accurate disk usage during testing, those WALs can account be huge when writing millions of keys.

Before Defrag, all 3 snapshot DBs:

▶ du -h -d1 ./checkpointState
 43M	./checkpointState/om.db-6639d124-6615-4ced-9af6-3dabd680727b
 63M	./checkpointState/om.db-d39279ce-cab6-44e0-839a-2baecb8c283a
 62M	./checkpointState/om.db-77b75627-5534-4db4-88e5-1661aceae92f
169M	./checkpointState

After Defrag, all 3 snapshot DBs:

▶ du -h -d1 ./checkpointStateDefragged
 83M	./checkpointStateDefragged/om.db-6639d124-6615-4ced-9af6-3dabd680727b
4.1M	./checkpointStateDefragged/om.db-d39279ce-cab6-44e0-839a-2baecb8c283a
4.1M	./checkpointStateDefragged/om.db-77b75627-5534-4db4-88e5-1661aceae92f
 92M	./checkpointStateDefragged

There are more scenarios to be tried out. Even the above scenario can be improved to trigger DB compaction after EACH snapshot taken. (And DELETED_TABLE / DELETED_DIR_TABLE should also be copied over in the impl.)

…o anything else for now. need to handle upgrade later. e.g. block defrag service until finalization is performed)

…efrag.limit.per.task`, `ozone.snapshot.defrag.service.interval`.

…n om triggerSnapshotDefrag -host=om`

…le with RDBSstFileWriter then ingest.

Added writeOptions.setDisableWAL(true) to reduce disk writes during testing. This change is intended for test environments only and should not be used in production.

Removes native library includes from rocks-native/pom.xml and adds an antrun plugin in dist/pom.xml to copy native libraries to the Ozone distribution. Updates the shell script to include the new library path in JAVA_LIBRARY_PATH for relevant commands. This improves the packaging and loading of native libraries in Ozone. Rocks tools native lib should not be inside jar apache#123 (-HDDS-11591. Copy dependencies when building each module)

Introduces a debug log message when System.loadLibrary fails, providing more visibility into library loading issues before attempting to load from the jar.

Changed the default values for lastDefragTime to -1L, needsDefragmentation to true, and version to 1 in the OmSnapshotLocalData constructor.

Moved OZONE_OM_SNAPSHOT_DIFF_REPORT_MAX_PAGE_SIZE_DEFAULT to immediately follow its related config key for better organization.

…ocal dev testing Refactored docker-compose.yaml to move volume mounts from the common config to individual services, adding service-specific data volumes and profiles for optional services. Updated docker-config with new Ozone and RocksDB settings, including increased snapshot diff page size, WAL configuration, and snapshot defragmentation interval.

Replaces calls to defragService.start() with defragService.triggerSnapshotDefragOnce() in OzoneManager.

Modified TestSnapshotDefragService2 to use FSO bucket layout. Adjusted configuration parameters for defrag service interval, timeout, and diff report page size. Updated key creation logic for zero-byte keys.

Updated RandomKeyGenerator to catch and log OMException when volumes or buckets already exist, allowing the process to continue instead of failing. Also standardized naming for volumes, buckets, and keys to remove random suffixes.

…ificationTime change as MODIFY instead of RENAME for zero-byte keys

Changed the visibility of the RocksCheckpoint class and its related methods (createCheckpoint and get) from package-private to public. This allows external classes to access and utilize these functionalities.

…g a lot of warning in the new integration test) Added support for the STAND_ALONE replication type in getReplicatedSize. For STAND_ALONE, the replicated size is set equal to the data size.

… others Inserted a TODO comment clarifying that only OM requires the snapshot directory, while SCM, DN, etc. do not.

smengcl · 2025-08-19T21:55:58Z

bugbot run

smengcl · 2025-08-22T22:15:57Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java

+  // Snapshot defragmentation service configuration
+  public static final String SNAPSHOT_DEFRAG_LIMIT_PER_TASK =
+      "ozone.snapshot.defrag.limit.per.task";
+  public static final int SNAPSHOT_DEFRAG_LIMIT_PER_TASK_DEFAULT = 1;


TODO: the interval can be set to 2 or 3 later

smengcl · 2025-08-22T22:19:13Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java

+  public static final String OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL =
+      "ozone.snapshot.defrag.service.interval";
+  public static final String
+      OZONE_SNAPSHOT_DEFRAG_SERVICE_INTERVAL_DEFAULT = "60s";


TODO: May need to increase this interval if snapshotDiff is taking long.

Does interval countdown start only after the service run ends? I vaguely remember that is the case. Need to double check

smengcl · 2025-11-03T19:55:28Z

Closing this one as proper impl are already being done in other PRs.

smengcl added 30 commits August 18, 2025 21:04

HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint

8f1b164

[split] Add new OM layout feature SNAPSHOT_DEFRAGMENTATION (doesn't d…

ad3ed61

…o anything else for now. need to handle upgrade later. e.g. block defrag service until finalization is performed)

Make RDBSstFileWriter public to be used in defrag service.

f81e5b7

Implement delete(key) in RDBSstFileWriter.

e15a62d

Add config ozone.snapshot.defrag.service.timeout, `ozone.snapshot.d…

c04d67a

…efrag.limit.per.task`, `ozone.snapshot.defrag.service.interval`.

Rename term: compact -> defrag

160434b

[split] HDDS-13009. Implement SnapshotDefragService POC.

ac58394

[split] Implement manual trigger of snapshot defrag. e.g. `ozone admi…

3de45ba

…n om triggerSnapshotDefrag -host=om`

Add tests (for debugging)

ab9d67e

[split] yaml: TODO: Add OM_SLD_DB_CHECKPOINT_DIR field

5bf6646

[drop][dev] Disable WAL archival

d9ace6b

SDS: Write diff to defragged DB directly instead of writing to SST fi…

1270359

…le with RDBSstFileWriter then ingest.

[dev] Disable WAL in write options for testing

410ba6e

Added writeOptions.setDisableWAL(true) to reduce disk writes during testing. This change is intended for test environments only and should not be used in production.

[dev] Add debug log for failed System.loadLibrary call

41abbc1

Introduces a debug log message when System.loadLibrary fails, providing more visibility into library loading issues before attempting to load from the jar.

Update default values in OmSnapshotLocalData constructor

61a7fac

Changed the default values for lastDefragTime to -1L, needsDefragmentation to true, and version to 1 in the OmSnapshotLocalData constructor.

[dev] testUpdateYaml to set fields in .yaml before triggering defrag

fb700d0

Remove TestSnapshotDefragService test class

2f1cb67

Reorder constant definition in OMConfigKeys

737416b

Moved OZONE_OM_SNAPSHOT_DIFF_REPORT_MAX_PAGE_SIZE_DEFAULT to immediately follow its related config key for better organization.

om: Use triggerSnapshotDefragOnce for snapshot defrag manual triggering

173ea8e

Replaces calls to defragService.start() with defragService.triggerSnapshotDefragOnce() in OzoneManager.

SDS: Do not fall back to full defrag on inc defrag error

f1e5def

SDS: Clean up, add TODOs

8124405

SDS: Fix extractKeyFromPath for FSO (hack).

026f74f

Rename CHECKPOINT_STATE_DEFRAGED_DIR -> CHECKPOINT_STATE_DEFRAGGED_DIR

af7dc32

Update TestSnapshotDefragService2 for FSO buckets and config tuning

f08ebe9

Modified TestSnapshotDefragService2 to use FSO bucket layout. Adjusted configuration parameters for defrag service interval, timeout, and diff report page size. Updated key creation logic for zero-byte keys.

[dev] Update key comparison logic in SnapshotDiffManager to count mod…

420323f

…ificationTime change as MODIFY instead of RENAME for zero-byte keys

Make RocksCheckpoint and related methods public

bcab08c

Changed the visibility of the RocksCheckpoint class and its related methods (createCheckpoint and get) from package-private to public. This allows external classes to access and utilize these functionalities.

[split] Handle STAND_ALONE replication type in QuotaUtil (was printin…

60fbe52

…g a lot of warning in the new integration test) Added support for the STAND_ALONE replication type in getReplicatedSize. For STAND_ALONE, the replicated size is set equal to the data size.

smengcl added 2 commits August 18, 2025 21:17

[split] TODO that we should disable db.snapshots creation for scm and…

fb5fa37

… others Inserted a TODO comment clarifying that only OM requires the snapshot directory, while SCM, DN, etc. do not.

Remove TestOmSnapshotDefrag integration test

09a381a

smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Aug 19, 2025

Update docker-compose.yaml

a527b67

smengcl commented Oct 6, 2025

View reviewed changes

smengcl closed this Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954

HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954

Uh oh!

smengcl commented Aug 19, 2025 •

edited

Loading

Uh oh!

smengcl commented Aug 19, 2025

Uh oh!

smengcl Aug 22, 2025

Uh oh!

smengcl Aug 22, 2025

Uh oh!

smengcl commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954

HDDS-13003. [POC] Snapshot Defragmentation to reduce storage footprint #8954

Uh oh!

Conversation

smengcl commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Results

Uh oh!

smengcl commented Aug 19, 2025

Uh oh!

smengcl Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

smengcl Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

smengcl commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smengcl commented Aug 19, 2025 •

edited

Loading