HDDS-13034. Refactor Directory Deleting Service to use ReclaimableDirectoryFilter & ReclaimableKeyFilter#573
HDDS-13034. Refactor Directory Deleting Service to use ReclaimableDirectoryFilter & ReclaimableKeyFilter#573jojochuang wants to merge 21 commits intomasterfrom
Conversation
…s and deleted subFiles Change-Id: Ic1fc709b3963cde14c2a7fb64b687322a29e642a
Change-Id: I47b24dfc3b5afa3cefbdc85ac7b3e4a9b8c94869
…ectoryFilter & ReclaimableKeyFilter Change-Id: Iffdda403cba914ddef2b75808cfbef1a72b9a2d3
Change-Id: I11acc3782aadf8393f731adcaa2a436dd9b534ae
…kets and volumes have already been deleted Change-Id: I16dc9d8f00686320b4e98fa5691420294a7f1e2f
Change-Id: Ie5fd1406bbb8af3a9ba76440dcba9b8d8db14691
… pipeline with different node (apache#8549)
Change-Id: I61ef68263ff88daa0e53dfb9d7d8eb62495d226b
Change-Id: I2667c6d12523f4dee7cbcf7c48c93803fe84d3d4
Change-Id: I5ed93af3b5ae794b0cfe4671ec2a851592edcb8c
…s and deleted subFiles (apache#8538)
…kets and volumes have already been deleted (apache#8551)
Change-Id: I4c234b75208d96a146c7498bb6c2c188a270e1b8 # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java # hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/AbstractReclaimableFilterTest.java
Change-Id: Iac3af98a7e568a135073b6704a6ad5a5fac7b427
Change-Id: I9b7b41cf667e03d48120a4201757e445227924f7
Change-Id: I2ff1cf3ecf3baa00a5c5646901f6c9ffdbe6e370
There was a problem hiding this comment.
Hello @jojochuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! gemini-code-assist here, providing a summary of this pull request.
The primary goal of this PR, as stated in the title and description, is to refactor the Directory Deleting Service (DDS) to unify the deep cleaning process for both snapshots and the Active Object Store (AOS). This unification is achieved by leveraging the ReclaimableDirectoryFilter and ReclaimableKeyFilter classes. The PR outlines an algorithm where the DDS iterates through deleted directory entries, initializes a thread-safe iterator, and submits purge requests. The logic for purging directories and files depends on whether they are still referenced by a previous snapshot. If a directory is referenced, only its subdirectories are expanded for future iterations, and reclaimable sub-files are moved to deleted space. If not referenced, all contents are removed. The process also includes updating snapshot exclusive size and deep clean flags once processing is complete for a store (AOS or snapshot).
Highlights
- Directory Deleting Service Refactor: The Directory Deleting Service has been significantly refactored to use
ReclaimableDirectoryFilterandReclaimableKeyFilter. This allows the same service to handle deep cleaning for both the Active Object Store and individual snapshots, unifying the logic and simplifying the implementation. - Snapshot Deep Cleaning Unification: The logic for deep cleaning directories within snapshots is now integrated into the main Directory Deleting Service, replacing the separate SnapshotDirectoryCleaningService. This streamlines the background deletion processes.
- Deprecated Configurations: Configuration keys specifically for the old SnapshotDirectoryCleaningService (
ozone.snapshot.directory.service.intervalandozone.snapshot.directory.service.timeout) have been marked as deprecated inOMConfigKeys.javaandozone-default.xml. - Container Balancer Validation: A new validation has been added to the Container Balancer configuration to ensure that the difference between
hdds.container.balancer.move.timeout,hdds.container.balancer.move.replication.timeout, andhdds.scm.replication.event.timeout.datanode.offsetis at least 9 minutes. - Debug Tool Enhancements: The
ozone debug replicas verifycommand now includes an option (--container-state) to check the state of containers and their replicas, identifying those in unhealthy or deleted states. A cache size option (--container-cache-size) was also added for this check. - Freon Bucket Layout Option: The Freon
RandomKeyGeneratortool now supports specifying the bucket layout (--bucket-layout) when creating buckets, allowing for testing with different bucket types like FILE_SYSTEM_OPTIMIZED.
Changelog
Click here to see the changelog
- hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java
- Updated the default value for
fs.trash.classnamefromorg.apache.hadoop.ozone.om.TrashPolicyOzonetoorg.apache.hadoop.fs.ozone.OzoneTrashPolicy(line 624).
- Updated the default value for
- hadoop-hdds/common/src/main/resources/ozone-default.xml
- Updated the default value for
fs.trash.classnameproperty toorg.apache.hadoop.fs.ozone.OzoneTrashPolicy(line 2792). - Added
DEPRECATEDtag to theozone.snapshot.directory.service.timeoutproperty description (line 3789). - Added
DEPRECATEDtag and prefix to the description forozone.snapshot.directory.service.intervalproperty (lines 3798-3800).
- Updated the default value for
- hadoop-hdds/docs/content/interface/Ofs.md
- Updated the
fs.trash.classnamevalue in the documentation example toorg.apache.hadoop.fs.ozone.OzoneTrashPolicy(line 240).
- Updated the
- hadoop-hdds/docs/content/interface/Ofs.zh.md
- Updated the
fs.trash.classnamevalue in the Chinese documentation example toorg.apache.hadoop.fs.ozone.OzoneTrashPolicy(line 223).
- Updated the
- hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
- Added imports for
DurationandTimeUnit(lines 22, 24). - Added a new validation check in
validateConfigurationto ensure sufficient time difference between balancer timeouts (lines 465-480).
- Added imports for
- hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
- Added a new test case to verify the validation of balancer timeout configurations, specifically the 9-minute minimum difference (lines 210-226).
- hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java
- Marked
OZONE_SNAPSHOT_DIRECTORY_SERVICE_INTERVALandOZONE_SNAPSHOT_DIRECTORY_SERVICE_TIMEOUTconstants as@Deprecated(lines 390, 393, 396, 399).
- Marked
- hadoop-ozone/dist/src/main/smoketest/debug/ozone-debug-tests.robot
- Removed unused variable
${OM_SERVICE_ID}(line 30). - Added
--container-stateand--block-existenceoptions to theozone debug replicas verifycommand in the test case (line 40).
- Removed unused variable
- hadoop-ozone/freon/src/main/java/org/apache/hadoop/ozone/freon/RandomKeyGenerator.java
- Added imports for
BucketArgsandBucketLayout(lines 63, 71). - Added a new command-line option
--bucket-layoutto specify the bucket layout (lines 196-200). - Added a
@VisibleForTestingmethodgetBucketMapSizeto return the size of the buckets map (lines 657-665). - Modified
createBucketmethod to useBucketArgsif--bucket-layoutis specified (lines 778-785). - Added
@VisibleForTestingannotation to thegetBucketmethod (line 908).
- Added imports for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/AbstractOzoneFileSystemTest.java
- Removed import for
TrashPolicyOzone(line 114). - Updated assertion in
testTrashto check forOzoneTrashPolicy.classinstead ofTrashPolicyOzone.class(line 1687).
- Removed import for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/AbstractRootedOzoneFileSystemTest.java
- Removed import for
TrashPolicyOzone(line 120). - Updated assertion in
testTrashto check forOzoneTrashPolicy.classinstead ofTrashPolicyOzone.class(line 1871).
- Removed import for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestContainerBalancerOperations.java
- Adjusted the default value for
moveReplicationTimeoutin test cases (lines 85, 150).
- Adjusted the default value for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestDirectoryDeletingServiceWithFSO.java
- Added imports for
ReclaimableDirFilterandReclaimableKeyFilter(lines 82, 83). - Replaced
assertThat(elapsedRunCount).isGreaterThanOrEqualTo(7)withGenericTestUtils.waitForfor more robust waiting intestDeleteWithLargeSubPathsThanBatchSize(line 282). - Updated the constructor call for
DirectoryDeletingServiceintestAOSKeyDeletingWithSnapshotCreateParallelExecutionto include thedeepCleanSnapshotsparameter (line 595). - Updated the mocked
optimizeDirDeletesAndSubmitRequestmethod call to include the new filter parameters (lines 629-630). - Changed the mocked method call from
getSnapshottogetActiveSnapshotintestAOSKeyDeletingWithSnapshotCreateParallelExecution(line 654). - Removed a blank line (line 732).
- Corrected assertions in
testDirDeletedTableCleanUpForSnapshotfordirTablecount (from 4 to 0) anddeletedDirTablecount (from 1 to 5) after directory deletion (lines 757, 765). - Corrected assertion for
getMovedDirsCountintestDirDeletedTableCleanUpForSnapshot(from 0 to 4) (line 773). - Updated the constructor call for
DirectoryDeletingServiceingetMockedDirectoryDeletingServiceto include thedeepCleanSnapshotsparameter (line 530). - Added mocking for
KeyManagerand itsgetDeletedDirEntriesmethod ingetMockedDirectoryDeletingService(lines 525-526, 539).
- Added imports for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestSnapshotDeletingServiceIntegrationTest.java
- Added import for
IOException(line 27). - Adjusted the timeout for
OZONE_SNAPSHOT_DELETING_SERVICE_TIMEOUTfrom 10000ms to 10ms and removed the setting forOZONE_SNAPSHOT_DIRECTORY_SERVICE_INTERVAL(lines 130-131). - Added assertions for initial table counts and suspended KDS/DDS before creating keys (lines 259-264).
- Added resume calls for KDS/DDS and waits for deep cleaning flags to be set after creating snapshots (lines 388-401, 444-457).
- Added assertions for table counts after deep cleaning and before deleting snap2 (lines 408-411).
- Added resume calls for KDS/DDS after deleting snap3 and updated assertions for deleted/renamed table counts (lines 472-478).
- Added mocking for
KeyManagerand itsgetDeletedDirEntriesmethod ingetMockedDirectoryDeletingService(lines 525-526, 539). - Updated the constructor call for
DirectoryDeletingServiceingetMockedDirectoryDeletingServiceto include thedeepCleanSnapshotsparameter (line 530).
- Added import for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDirectoryCleaningService.java
- Added import for
IOException(line 27). - Added imports for
OmMetadataManagerImplandSnapshotChainManager(lines 48, 49). - Replaced import for
SnapshotDirectoryCleaningServicewithDirectoryDeletingService(line 55). - Changed the configuration key used for the interval from
OZONE_SNAPSHOT_DIRECTORY_SERVICE_INTERVALtoOZONE_DIR_DELETING_SERVICE_INTERVAL(line 82). - Replaced
SnapshotDirectoryCleaningServicewithDirectoryDeletingServiceand addedSnapshotChainManagerin the test setup (lines 146-149). - Replaced references to
snapshotDirectoryCleaningServicewithdirectoryDeletingServiceand adjusted the wait condition for run count (lines 228-230). - Added a wait condition within the snapshot iteration to ensure the next snapshot in the chain is deep cleaned before checking exclusive size (lines 247-255).
- Added import for
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneDebugShell.java
- Added
--container-stateoption to theozone debug replicas verifycommand in the test case (line 106).
- Added
- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneShellHA.java
- Added import for
OzoneTrashPolicy(line 72). - Removed import for
TrashPolicyOzone(line 95). - Updated comments and code to refer to
OzoneTrashPolicyinstead ofTrashPolicyOzone(lines 994, 998, 1004, 1189).
- Added import for
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManager.java
- Removed import for
SnapshotDirectoryCleaningService(line 44). - Added a default method
getDeletedDirEntries()that calls the overloaded version with nulls (lines 276-280). - Modified the signature of
getPendingDeletionSubDirsto accept afilterparameter (lines 309-310). - Modified the signature of
getPendingDeletionSubFilesto accept afilterparameter (lines 321-322). - Removed the
getSnapshotDirectoryServicemethod (lines 355-360).
- Removed import for
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java
- Removed imports for deprecated snapshot directory service config keys (lines 61-64).
- Added import for
Table.KeyValue(line 130). - Added import for
WithParentObjectId(line 160). - Removed import for
SnapshotDirectoryCleaningService(line 169). - Removed the
snapshotDirectoryCleaningServicefield (line 215). - Updated the constructor call for
DirectoryDeletingServiceto include theisSnapshotDeepCleaningEnabledparameter (line 289). - Removed the logic for starting the
snapshotDirectoryCleaningService(lines 351-365). - Removed the logic for shutting down the
snapshotDirectoryCleaningService(lines 444-447). - Updated the signature of
getPendingDeletionKeysto useKeyValueinstead ofTable.KeyValue(lines 713, 721). - Updated the type of the iterator in
getPendingDeletionKeysto useKeyValue(line 727). - Updated the type of the
kvvariable ingetPendingDeletionKeysto useKeyValue(line 739). - Updated the signature of
getTableEntriesto useKeyValueinstead ofTable.KeyValue(lines 774, 777, 779). - Updated the type of the
kvvariable ingetTableEntriesto useKeyValue(line 790). - Updated the signature of
getRenamesKeyEntriesto useKeyValueinstead ofTable.KeyValue(lines 812, 814). - Updated the type of the iterator in
getRenamesKeyEntriesto useKeyValue(line 816). - Updated the signature of
getDeletedKeyEntriesto useKeyValueinstead ofTable.KeyValue(lines 861, 863). - Updated the type of the iterator in
getDeletedKeyEntriesto useKeyValue(line 866). - Removed the
getSnapshotDirectoryServicemethod (lines 956-959). - Updated the type of the iterator in
createFakeDirIfShouldto useKeyValue(line 1511). - Updated the type of the
keyValuevariable increateFakeDirIfShouldto useKeyValue(line 1514). - Updated the type of the iterator in
listStatusto useKeyValue(line 1825). - Updated the signature of
getIteratorForKeyInTableCacheto useKeyValueinstead ofTable.KeyValue(lines 1882, 1887). - Updated the signature of
findKeyInDbWithIteratorto useKeyValueinstead ofTable.KeyValue(lines 1905, 1910). - Updated the signature of
getDeletedDirEntriesto useKeyValueinstead ofTable.KeyValue(line 2161). - Refactored
getPendingDeletionSubDirsandgetPendingDeletionSubFilesto use a new private helper methodgatherSubPathsWithIterator(lines 2168-2229). - Added the private helper method
gatherSubPathsWithIteratorto handle iterating and filtering sub-paths (lines 2177-2213).
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java
- Included
om.getKeyManager().getDirDeletingService()and removedom.getKeyManager().getSnapshotDirectoryService()from the list of services to lock during checkpointing (lines 707, 709).
- Included
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/TrashPolicyOzone.java
- Changed the class visibility from
publicto package-private (class TrashPolicyOzone) (line 51). - Removed the default constructor
TrashPolicyOzone()(lines 65-66).
- Changed the class visibility from
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileRequest.java
- Added static helper methods
getKeyInfoWithFullPathto createOmKeyInfowith full path fromOmDirectoryInfoor existingOmKeyInfo(lines 723-737).
- Added static helper methods
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/AbstractKeyDeletingService.java
- Added imports for
OptionalandStringUtils(lines 31, 37). - Added imports for
OMResponseandCheckedFunction(lines 63, 70). - Removed
InterruptedExceptionfrom the throws clause ofprocessKeyDeletes(line 110). - Removed
InterruptedExceptionfrom the throws clause ofsubmitPurgeKeysRequest(line 150). - Updated the throws clause of
submitRequestto includeInterruptedException(line 245). - Changed the return type of
submitPurgePathsfromvoidtoOMResponseand added bootstrap lock acquisition (lines 249, 274-279). - Modified the signature of
prepareDeleteDirRequestto accept apurgeDirboolean, areclaimableFileFilter, and returnOptional<PurgePathRequest>(lines 312-317). - Updated the call to
keyManager.getPendingDeletionSubDirsinprepareDeleteDirRequestto pass a filter that always returns true (line 331). - Updated the call to
keyManager.getPendingDeletionSubFilesinprepareDeleteDirRequestto pass a filter based onpurgeDiror thereclaimableFileFilter(line 349). - Modified the logic in
prepareDeleteDirRequestto returnOptional.empty()if no sub-paths are found and the directory is not purged (lines 362-364). - Updated the signature of
optimizeDirDeletesAndSubmitRequestto acceptreclaimableDirCheckerandreclaimableFileCheckerfilters (lines 376-377). - Modified the loop in
optimizeDirDeletesAndSubmitRequestto use the newprepareDeleteDirRequestsignature and handle theOptionalreturn value (lines 388-405).
- Added imports for
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java
- Added imports for various utility and concurrency classes (
Maps,Closeable,Collection,Iterator,Map,CompletableFuture,ExecutionException,ExecutorService,LinkedBlockingDeque,ThreadPoolExecutor,Stream,StringUtils) (lines 21-40). - Added imports for
KeyManager,SnapshotChainManager,IOzoneManagerLock,ReclaimableDirFilter,ReclaimableKeyFilter,OzoneManagerProtocolProtos(lines 51, 57, 60, 62, 63, 64). - Removed
dirDeletingCorePoolSize,deletedDirSupplier, andtaskCountfields (lines 79, 84, 86). - Added
snapshotChainManager,deepCleanSnapshots,deletionThreadPool, andnumberOfParallelThreadsPerStorefields (lines 98-101). - Updated the constructor to accept
deepCleanSnapshotsand initialize the thread pool (lines 105-114). - Removed the
getTaskCountmethod (lines 118-120). - Modified
getTasksto create one task for AOS (snapshotId=null) and one task for each snapshot if deep cleaning is enabled (lines 158-171). - Removed the override of the
shutdownmethod (lines 177-179). - Refactored
DeletedDirSupplierto be a static inner class implementingCloseableand taking the iterator in its constructor (lines 181-199). - Modified
DirDeletingTaskto accept asnapshotIdin its constructor (line 206). - Added
getSetSnapshotRequestUpdatingExclusiveSizehelper method (lines 216-225). - Added
processDeletedDirsForStoremethod to handle processing for a specific store (AOS or snapshot) using multiple threads (lines 233-289). - Added
processDeletedDirectorieshelper method to perform the core directory processing logic, including using filters and updating exclusive size maps (lines 291-375). - Removed the
previousSnapshotHasDirmethod (lines 284-341 in old code). - Removed the
getPendingDeletedDirInfomethod (lines 344-347 in old code). - Updated the
callmethod to use the newprocessDeletedDirsForStoremethod and handle snapshot-specific logic (lines 378-427).
- Added imports for various utility and concurrency classes (
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/KeyDeletingService.java
- Removed
InterruptedExceptionfrom the throws clause ofprocessDeletedKeysForStore(line 207). - Updated log message in
callfor deep cleaned snapshots to include snapshot name (lines 326-327). - Removed
InterruptedExceptionfrom the throws clause ofcall(line 348).
- Removed
- hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/filter/ReclaimableFilter.java
- Added import for
OMException(line 36). - Modified
initializePreviousSnapshotsFromChainto useozoneManager.getBucketManager().getBucketInfoand handleVOLUME_NOT_FOUNDorBUCKET_NOT_FOUNDexceptions by settingbucketInfoandvolumeIdto null (lines 175-185). - Modified
applymethod to return true immediately ifbucketInfois null (indicating the volume/bucket was deleted) (line 201).
- Added import for
- hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/AbstractReclaimableFilterTest.java
- Added imports for
BucketManagerandOMException(lines 51, 58). - Added parameterized tests
testReclaimableFilterSnapshotChainInitializationWithInvalidVolumeandtestReclaimableFilterSnapshotChainInitializationWithInvalidBucketto test filter initialization with non-existent volumes/buckets (lines 133-161). - Modified
getLastSnapshotInfosto usegetOrDefaultto handle cases where the bucket key might not exist in the map (line 246). - Modified
mockOzoneManagerto mockBucketManagerand provide custom logic forgetBucketInfoto simulate volume/bucket not found exceptions (lines 164-185).
- Added imports for
- hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/TestReclaimableDirFilter.java
- Updated
testReclaimableDirFilterto usegetOzoneManager().getBucketManager().getBucketInfoinstead ofgetOzoneManager().getBucketInfo(line 75).
- Updated
- hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/TestReclaimableKeyFilter.java
- Updated
testReclaimableKeyFilterto usegetOzoneManager().getBucketManager().getBucketInfoinstead ofgetOzoneManager().getBucketInfo(line 104).
- Updated
- hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/TestReclaimableRenameEntryFilter.java
- Updated
testReclaimableRenameEntryFilterto usegetOzoneManager().getBucketManager().getBucketInfoinstead ofgetOzoneManager().getBucketInfo(line 83).
- Updated
- hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/BlockExistenceVerifier.java
- Added
setId(datanode.getID())when building the pipeline (line 59).
- Added
- hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java
- Added a new file
ContainerStateVerifier.javaimplementingReplicaVerifierto check container and replica states.
- Added a new file
- hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ReplicasVerify.java
- Added
--container-cache-sizecommand-line option (lines 70-76). - Added logic to include
ContainerStateVerifierin the list of verifiers if the--container-stateoption is used (lines 91-93). - Added
doExecuteReplicaStateboolean field to theVerificationinner class, controlled by the--container-stateoption (lines 236-241).
- Added
- hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/chunk/ChunkKeyHandler.java
- Changed the JSON key from
ecIndextoreplicaIndexwhen reporting EC chunk information (line 199).
- Changed the JSON key from
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the directory deletion logic to unify snapshot and AOS directory cleanup under the new DirectoryDeletingService using ReclaimableDirectoryFilter and ReclaimableKeyFilter, and removes the old SnapshotDirectoryCleaningService.
- Introduce
dirDeletingServicein OM bootstrap and removesnapshotDirectoryCleaningService. - Update
KeyManagerAPI and implementation to use genericWithParentObjectIdandTable.KeyValuealias. - Adjust integration tests to use new service, filters, and updated trash policy class.
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java | Added getDirDeletingService() to lock list; removed old snapshot directory service |
| ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java | Replaced SnapshotDirectoryCleaningService with DirectoryDeletingService, updated iterator types to KeyValue |
| ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManager.java | Updated API signatures for directory/file sub-path retrieval with filter parameter |
| integration-test/.../TestSnapshotDirectoryCleaningService.java | Switched to DirectoryDeletingService and added snapshot chain waiting logic |
| integration-test/.../TestDirectoryDeletingServiceWithFSO.java | Updated service instantiation to include new filter arguments and replaced assertions with waitFor |
| integration-test/.../TestRandomKeyGenerator.java | Added --bucket-layout option and exposed bucket map methods for testing |
| integration-test/.../TestOzoneShellHA.java | Updated trash policy import and documentation to use OzoneTrashPolicy |
| common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java | Marked snapshot directory service configs as @Deprecated |
| common/src/main/resources/ozone-default.xml | Updated default trash policy and deprecated tags for snapshot configs |
| server-scm/.../ContainerBalancer.java | Added new validation rule for move timeouts with detailed error message |
| server-scm/.../TestContainerBalancer.java | Enhanced test to validate new exception message and assert stopped status |
Comments suppressed due to low confidence (2)
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManager.java:310
- The Javadoc above
getPendingDeletionSubFilesno longer mentions the newfilterparameter. Update the comment to describe what thefilterfunction does and when it is applied.
DeleteKeysResult getPendingDeletionSubFiles(long volumeId, long bucketId, OmKeyInfo parentInfo, CheckedFunction<Table.KeyValue<String, OmKeyInfo>, Boolean, IOException> filter, long remainingBufLimit) throws IOException;
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDirectoryCleaningService.java:247
- Missing import for
SnapshotUtilscauses a compilation error. Addimport org.apache.hadoop.ozone.om.snapshot.SnapshotUtils;at the top of the test file.
SnapshotInfo nextSnapshot = SnapshotUtils.getNextSnapshot(cluster.getOzoneManager(), snapshotChainManager,
| "should be greater than or equal to 540000ms or 9 minutes.", | ||
| conf.getMoveTimeout().toMillis(), | ||
| conf.getMoveReplicationTimeout().toMillis(), | ||
| Duration.ofMillis(datanodeOffset).toMillis()); |
There was a problem hiding this comment.
[nitpick] The use of Duration.ofMillis(datanodeOffset).toMillis() is redundant since datanodeOffset is already in milliseconds. You can simplify the error message formatting by using datanodeOffset directly.
| Duration.ofMillis(datanodeOffset).toMillis()); | |
| datanodeOffset); |
| assertTrue(ex.getMessage().contains("should be greater than or equal to 540000ms or 9 minutes"), | ||
| "Exception message should contain 'should be greater than or equal to 540000ms or 9 minutes'"); |
There was a problem hiding this comment.
[nitpick] This assertion only checks for a substring in the exception message. Consider asserting the full message or using a more expressive assertion helper to guard against unintended changes in the text.
| assertTrue(ex.getMessage().contains("should be greater than or equal to 540000ms or 9 minutes"), | |
| "Exception message should contain 'should be greater than or equal to 540000ms or 9 minutes'"); | |
| String expectedMessage = "(hdds.container.balancer.move.timeout - hdds.container.balancer.move.replication.timeout " + | |
| "- hdds.scm.replication.event.timeout.datanode.offset) should be greater than or equal to 9 minutes."; | |
| assertEquals(expectedMessage, ex.getMessage(), "Exception message should match the expected message."); |
| assertThat(dirDeletingService.getRunCount().get()).isGreaterThan(1); | ||
| // Ensure dir deleting speed, here provide a backup value for safe CI | ||
| assertThat(elapsedRunCount).isGreaterThanOrEqualTo(7); | ||
| GenericTestUtils.waitFor(() -> dirDeletingService.getRunCount().get() - preRunCount >= 7, 1000, 100000); |
There was a problem hiding this comment.
[nitpick] The magic number 7 for minimum run count is unclear. Extract it into a named constant or add a comment explaining why at least 7 runs are expected to improve readability.
| GenericTestUtils.waitFor(() -> dirDeletingService.getRunCount().get() - preRunCount >= 7, 1000, 100000); | |
| GenericTestUtils.waitFor(() -> dirDeletingService.getRunCount().get() - preRunCount >= MINIMUM_RUN_COUNT_FOR_SAFE_CI, 1000, 100000); |
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the DirectoryDeletingService to unify Active Object Store (AOS) and snapshot directory deletion logic using ReclaimableDirectoryFilter and ReclaimableKeyFilter. This is a significant improvement for code simplification and maintainability. The changes are comprehensive, touching configuration, core logic, documentation, and tests.
The introduction of parallel processing per store in DirectoryDeletingService and the new ContainerStateVerifier are valuable additions. The deprecation of old configurations and services is also handled consistently.
I've identified a few areas for potential improvement, mainly around code clarity, constant usage, and test timeouts. Addressing these would further enhance the quality of this contribution.
Summary of Findings
- Refactoring and Unification: The core refactoring of
DirectoryDeletingServiceto handle both AOS and snapshot directory deletions is a significant improvement, centralizing logic and reducing redundancy. - Configuration and Service Lifecycle: Old snapshot directory cleaning services and configurations are correctly deprecated and removed, aligning with the refactoring.
- Clarity and Maintainability: Some areas, particularly around complex logic in
DirectoryDeletingServiceand the use of magic numbers inContainerBalancer, could benefit from enhanced comments or constants for better long-term maintainability. - Error Handling: The error handling in asynchronous operations within
DirectoryDeletingServicecould be more specific to avoid masking critical system errors. - Testing: Test suites have been broadly updated, which is good. Some test timeouts appear excessively long and could be reviewed.
Merge Readiness
This pull request introduces a substantial and beneficial refactoring by unifying the directory deletion logic. The changes are extensive and appear to be well-covered by test updates.
I've made a few suggestions of medium severity primarily aimed at improving code clarity, maintainability (e.g., using constants instead of magic numbers, clarifying constructor parameters), and refining error handling in asynchronous parts of the DirectoryDeletingService. Additionally, a very long timeout in one of the integration tests might warrant a review.
While these points are not critical blockers, addressing them would further strengthen the codebase. As I am an AI assistant, I cannot approve pull requests. This PR should be reviewed and approved by other maintainers. I recommend addressing the suggested changes before merging.
| if ((conf.getMoveTimeout().toMillis() - conf.getMoveReplicationTimeout().toMillis() - datanodeOffset) | ||
| < Duration.ofMinutes(9).toMillis()) { | ||
| String msg = String.format("(hdds.container.balancer.move.timeout (%sms) - " + | ||
| "hdds.container.balancer.move.replication.timeout (%sms) - " + | ||
| "hdds.scm.replication.event.timeout.datanode.offset (%sms)) " + | ||
| "should be greater than or equal to 540000ms or 9 minutes.", | ||
| conf.getMoveTimeout().toMillis(), | ||
| conf.getMoveReplicationTimeout().toMillis(), | ||
| Duration.ofMillis(datanodeOffset).toMillis()); |
There was a problem hiding this comment.
The new validation logic for move timeouts (lines 469-470) uses a hardcoded duration of 9 minutes and its millisecond equivalent 540000ms.
if ((conf.getMoveTimeout().toMillis() - conf.getMoveReplicationTimeout().toMillis() - datanodeOffset)
< Duration.ofMinutes(9).toMillis()) { // 9 minutes
String msg = String.format("(hdds.container.balancer.move.timeout (%sms) - " +
"hdds.container.balancer.move.replication.timeout (%sms) - " +
"hdds.scm.replication.event.timeout.datanode.offset (%sms)) " +
"should be greater than or equal to 540000ms or 9 minutes.", // 540000ms
conf.getMoveTimeout().toMillis(),
conf.getMoveReplicationTimeout().toMillis(),
Duration.ofMillis(datanodeOffset).toMillis());Could these values (9 minutes and 540000ms) be defined as named constants for better readability and maintainability? For example:
private static final Duration MIN_MOVE_TIMEOUT_DIFFERENCE = Duration.ofMinutes(9);
// ...
if ((conf.getMoveTimeout().toMillis() - conf.getMoveReplicationTimeout().toMillis() - datanodeOffset)
< MIN_MOVE_TIMEOUT_DIFFERENCE.toMillis()) {
String msg = String.format("... should be greater than or equal to %sms or %s minutes.",
MIN_MOVE_TIMEOUT_DIFFERENCE.toMillis(),
MIN_MOVE_TIMEOUT_DIFFERENCE.toMinutes());
// ...
}Additionally, on line 477, Duration.ofMillis(datanodeOffset).toMillis() is used. If datanodeOffset is already in milliseconds (as obtained from ozoneConfiguration.getTimeDuration with TimeUnit.MILLISECONDS), this conversion seems redundant. Could datanodeOffset be used directly in the String.format call here?
| } catch (IOException e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| }, 2000, 100000000); |
There was a problem hiding this comment.
The timeout value 100000000 (milliseconds, which is ~27.7 hours) for GenericTestUtils.waitFor seems excessively large for an integration test.
}, 2000, 100000000);Is such a long timeout necessary here? Could this potentially mask underlying issues or lead to very long test execution times if the condition is never met? Consider reducing this to a more typical timeout for integration tests (e.g., a few minutes) unless this specific scenario truly requires such an extended wait.
| public DirectoryDeletingService(long interval, TimeUnit unit, | ||
| long serviceTimeout, OzoneManager ozoneManager, | ||
| OzoneConfiguration configuration, int dirDeletingServiceCorePoolSize) { | ||
| OzoneConfiguration configuration, int dirDeletingServiceCorePoolSize, boolean deepCleanSnapshots) { | ||
| super(DirectoryDeletingService.class.getSimpleName(), interval, unit, | ||
| dirDeletingServiceCorePoolSize, serviceTimeout, ozoneManager, null); | ||
| int limit = (int) configuration.getStorageSize( | ||
| OMConfigKeys.OZONE_OM_RATIS_LOG_APPENDER_QUEUE_BYTE_LIMIT, | ||
| OMConfigKeys.OZONE_OM_RATIS_LOG_APPENDER_QUEUE_BYTE_LIMIT_DEFAULT, | ||
| StorageUnit.BYTES); |
There was a problem hiding this comment.
The DirectoryDeletingService constructor now includes a deepCleanSnapshots boolean parameter.
public DirectoryDeletingService(long interval, TimeUnit unit,
long serviceTimeout, OzoneManager ozoneManager,
OzoneConfiguration configuration, int dirDeletingServiceCorePoolSize, boolean deepCleanSnapshots) {
super(DirectoryDeletingService.class.getSimpleName(), interval, unit,
dirDeletingServiceCorePoolSize, serviceTimeout, ozoneManager, null);
// ...
this.deepCleanSnapshots = deepCleanSnapshots;
}This flag appears to control whether the service processes directories within snapshots (as seen in getTasks() method, lines 160-172). Could a Javadoc comment be added to the constructor or the class to clarify the role of deepCleanSnapshots and how it influences the service's behavior? This would enhance understanding for future maintenance.
| try { | ||
| return processDeletedDirectories(currentSnapshotInfo, keyManager, dirSupplier, remainingBufLimit, | ||
| expectedPreviousSnapshotId, exclusiveSizeMap, rnCnt); | ||
| } catch (Throwable e) { |
There was a problem hiding this comment.
In DirDeletingTask#processDeletedDirsForStore, the CompletableFuture.supplyAsync lambda catches a generic Throwable e:
} catch (Throwable e) { // Catches Throwable
return false;
}While this ensures the future completes and returns false on error, catching Throwable can hide critical errors like OutOfMemoryError or other Errors that might indicate more severe system problems.
Would it be more appropriate to catch specific, expected exceptions (e.g., IOException, RuntimeException) that processDeletedDirectories might throw, and let other Errors propagate or be handled differently? This could aid in diagnosing more serious issues if they occur.
| try { | ||
| bucketInfo = ozoneManager.getBucketManager().getBucketInfo(volume, bucket); | ||
| volumeId = ozoneManager.getMetadataManager().getVolumeId(volume); | ||
| } catch (OMException e) { | ||
| // If Volume or bucket has been deleted then all keys should be reclaimable as no snapshots would exist. | ||
| if (OMException.ResultCodes.VOLUME_NOT_FOUND == e.getResult() || | ||
| OMException.ResultCodes.BUCKET_NOT_FOUND == e.getResult()) { | ||
| bucketInfo = null; | ||
| volumeId = null; | ||
| return; | ||
| } | ||
| throw e; | ||
| } |
There was a problem hiding this comment.
The logic for handling OMException (specifically VOLUME_NOT_FOUND or BUCKET_NOT_FOUND) within initializePreviousSnapshotsFromChain is as follows:
try {
bucketInfo = ozoneManager.getBucketManager().getBucketInfo(volume, bucket);
volumeId = ozoneManager.getMetadataManager().getVolumeId(volume);
} catch (OMException e) {
// If Volume or bucket has been deleted then all keys should be reclaimable as no snapshots would exist.
if (OMException.ResultCodes.VOLUME_NOT_FOUND == e.getResult() ||
OMException.ResultCodes.BUCKET_NOT_FOUND == e.getResult()) {
bucketInfo = null;
volumeId = null;
return; // Exits initialization early
}
throw e;
}When a volume/bucket is not found, bucketInfo and volumeId are set to null, and the method returns. Subsequently, in the apply() method (line 201), bucketInfo == null is correctly treated as a condition for reclaimability.
However, if initializePreviousSnapshotsFromChain returns early due to VOLUME_NOT_FOUND or BUCKET_NOT_FOUND, the previousOmSnapshots and previousSnapshotInfos lists (which are cleared and repopulated in this method) might not be fully populated as intended for the numberOfPreviousSnapshotsFromChain.
Does the isReclaimable(keyValue) method, which might rely on these lists, correctly handle this state where bucketInfo is null but the snapshot-related lists might be empty or only partially initialized from a prior successful call for a different key?
A brief comment clarifying this interaction or ensuring these lists are reset to a consistent state (e.g., fully cleared if bucketInfo becomes null during initialization) could be beneficial for maintainability.
What changes were proposed in this pull request?
With implementation of ReclaimableKeyFilter & ReclaimableDirectoryFilter both snapshot directory deep cleaning and AOS DirectoryDeletingService can be unified which will simplify the entire implementation rather than having multiple implementations for DirectoryDeletingService.
Algorithm in play:
The directory deleting service would be iterate through all the entries in the deletedDirectoryTable for a store(AOS and snapshot). One thread processing deleted directory entry for a particular store.
This thread would further initialize a DeletedDirectorySupplier which is an implementation for thread safe table iterator(We can eventually use a TableSpliterator after HDDS-12742. Make RDBStoreAbstractIterator Return Reference-Counted KeyValues apache/ozone#8203 gets merged).
Each thread here would be iterating through the deletedDirectoryTable independently and would submit purgeDirectoriesRequests where the logic for directoryPurge is as follows:
- Check if the deletedDirectory object from deletedDirectoryTable can be reclaimed as in check if the reference for directory object exists in the previous directory.
- If the directory exists in the previous snapshot then we need not expand the sub files inside the files[Number of files would be a lot generally (number of files >>> number of directories)] and we would just expand all the subdirectories under the directory(this would help in the recursively iterating on the sub files in the next iteration instead of traversing again from root all over). Now we cannot purge the directories as a reference exists we would just move all sub directories present in the AOS directory table and files which can be reclaimed i.e. files which are not present in the previous snapshot's fileTable by iterating through the sub file list based on deleted directory prefix in the AOS file table and move it to the deleted space of the store[Check HDDS-11244. OmPurgeDirectoriesRequest should clean up File and Directory tables of AOS for deleted snapshot directories apache/ozone#8509].
- If the directory doesn't exist in the previous snapshot we have to blindly remove all entries[sub files & sub directories] inside this directory irrespective whether it can be reclaimed or not and purge the directory if all the entries are going to be moved from AOS to the specific store's deleted space of the store.
Now once all threads have processed if none of the threads have been able to move any entry from the key space to deleted space(All keys already have been reclaimed in the previous run) we can go ahead update the exclusive size of the previous snapshot and the deep clean flag of the snapshot.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13034
How was this patch tested?
Existing unit tests and additional assertions added on existing tests.