-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots #8554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s and deleted subFiles Change-Id: Ic1fc709b3963cde14c2a7fb64b687322a29e642a
Change-Id: I47b24dfc3b5afa3cefbdc85ac7b3e4a9b8c94869
…ectoryFilter & ReclaimableKeyFilter Change-Id: Iffdda403cba914ddef2b75808cfbef1a72b9a2d3
Change-Id: I11acc3782aadf8393f731adcaa2a436dd9b534ae
…ractDeletingService Change-Id: I11ad5f48e25a7d22676a061bf43d8d168f0ae683
…kets and volumes have already been deleted Change-Id: I16dc9d8f00686320b4e98fa5691420294a7f1e2f
Change-Id: Ie5fd1406bbb8af3a9ba76440dcba9b8d8db14691
Change-Id: I61ef68263ff88daa0e53dfb9d7d8eb62495d226b
Change-Id: I2667c6d12523f4dee7cbcf7c48c93803fe84d3d4
Change-Id: Ib517326c453a156ca9c81cb2c856238d4bc19e69 # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java
Change-Id: Iba8e6ad3fd3e9b997bcaaf1a80f1ea92ae59b6f9
Change-Id: I2e5e5a95079dbda192b5305f18e4b29b0eeb620d
Change-Id: I5ed93af3b5ae794b0cfe4671ec2a851592edcb8c
Change-Id: I587f16ede2b787cd73c7f9a11c368a372be07ed7
Change-Id: Ifda5cd552ad23086cec7163d2a7983ad1dd5f5c4
…urging deleted snapshots Change-Id: I7f0c12c53f8838e652624bb993a6c8414ef638c7
Change-Id: I4c234b75208d96a146c7498bb6c2c188a270e1b8 # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java # hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/AbstractReclaimableFilterTest.java
Change-Id: I92114f919459d2f4ca716dd0f983ae6a23db85c1
Change-Id: I87ee106c43c88cc875d8fb39ac29c7b06d797394
Change-Id: Iac3af98a7e568a135073b6704a6ad5a5fac7b427
Change-Id: Idd2afd9625c7bd811f32e4e307d18e604d5a2583
Change-Id: I9b7b41cf667e03d48120a4201757e445227924f7
...st/src/test/java/org/apache/hadoop/ozone/om/service/TestDirectoryDeletingServiceWithFSO.java
Outdated
Show resolved
Hide resolved
Change-Id: I2ff1cf3ecf3baa00a5c5646901f6c9ffdbe6e370
Change-Id: I28997dbafe0f6eba2fc02d00310e55e7c33b0dda
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow my review comments were submitted... here it is.
| // Suspend KeyDeletingService | ||
| dirDeletingService.suspend(); | ||
| GenericTestUtils.waitFor(() -> !dirDeletingService.isRunningOnAOS(), 1000, 10000); | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replacing conditional wait with a fixed sleep time can make the test brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I fixed the suspend function to wait for the existing runs we wouldn't require the Thread.sleep() anymore
| } catch (IOException e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| }, 2000, 100000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah fixed it in
#8546
| conf.setTimeDuration(OZONE_SNAPSHOT_DELETING_SERVICE_TIMEOUT, | ||
| 10000, TimeUnit.MILLISECONDS); | ||
| conf.setInt(OMConfigKeys.OZONE_SNAPSHOT_DIRECTORY_SERVICE_INTERVAL, 500); | ||
| 10, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work reducing the time interval from 10 seconds to 10 milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we should reduce it from 10 secs to a smaller value so that the test doesn't time out. I changed it to 500 ms so that it runs along the same lines as KDS & DDS
Change-Id: I05017c54665f3ed50fcb1827479a6de3c99e31b1 # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java
Change-Id: I1abad0f11b09d8aff0bf5ab0e7619efe29b6d0fc
Change-Id: Ia906d1ef857143acd2b336f561ed25ee6e4ad451
Change-Id: I232f2c4915b1ed063729a4758588abc98569109d
Change-Id: Ibcc5f5d3893b532894133298f0c41f967cbdf287
Change-Id: If0cd16b56a5e3caeb28f3c6102d6a42c8309f548 # Conflicts: # hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/service/TestDirectoryDeletingServiceWithFSO.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/AbstractKeyDeletingService.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java
Change-Id: If3b294f9501cd8fb2c0a728e03013dcd911edc58
Change-Id: I3152bf8948909e0e34bcf3fc7558b9e0b2f81565
Change-Id: Ie5dbf45f2e4029d02fcb1df1a633f403911958ae # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/KeyDeletingService.java
Change-Id: If02e05b845f4e39cde18636c8d78573beb5b32a4
|
@SaketaChalamchala please take a look at this patch |
| lockIds.add(nextSnapshot.getSnapshotId()); | ||
| } | ||
| // Acquire write lock on current snapshot and next snapshot in chain. | ||
| if (!snapshotIdLocks.acquireLock(lockIds).isLockAcquired()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically MultiSnapshotLocks replaced AtomicBoolean isRunningOnAOS right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
| // nextSnapshot = null means entries would be moved to AOS. | ||
| if (nextSnapshot == null) { | ||
| LOG.info("Snapshot: {} entries will be moved to AOS.", snapInfo.getTableKey()); | ||
| waitForKeyDeletingService(); | ||
| waitForDirDeletingService(); | ||
| } else { | ||
| LOG.info("Snapshot: {} entries will be moved to next active snapshot: {}", | ||
| snapInfo.getTableKey(), nextSnapshot.getTableKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we print this message before the lock? It makes it easier to debug later if things happen in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense. let me make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Filename
TestSnapshotDeletingServiceIntegrationTest.java
looks redundant to me because this is already in the integration test path. Could just be:
TestSnapshotDeletingService.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot rename as there is already a Test class with the name TestSnapshotDeletingService
|
Pls resolve the code conflicts and nits and I think this is good to go. |
Change-Id: I1bba69a699341e885c0093fe81685395eae70144 # Conflicts: # hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/service/TestSnapshotDeletingServiceIntegrationTest.java
Change-Id: If6942f6e88d96103ecdfd43f54d1502b8429f0ff
Change-Id: I586687c65af1a19b10a32738ac8dceef6066eddf
|
Thanks for reviewing the PR @smengcl & @jojochuang |
…ging deleted snapshots (apache#8554)
…ging deleted snapshots (apache#8554)
…239-container-reconciliation Commits: 62 da53b5b HDDS-13299. Fix failures related to delete (apache#8665) 8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662) 7329859 HDDS-13023. Container checksum is missing after container import (apache#8459) a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657) f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638) e9c0a45 HDDS-13262. Simplify key name validation (apache#8619) f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647) b574709 HDDS-12924. datanode used space calculation optimization (apache#8365) de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620) 97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639) 9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601) d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648) 6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503) a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645) 9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646) 48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630) 052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559) 90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633) 9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003) ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617) 3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631) 7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575) 5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626) 77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621) ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629) 36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628) d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627) 3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623) 8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625) a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624) 7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600) fd13300 HDDS-10775. Support bucket ownership verification (apache#8558) 3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584) ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554) 38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618) f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590) 0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612) ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616) 5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599) 1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614) 6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606) 4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596) bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607) 6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305) 09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611) 931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605) 3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest 6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573) 752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349) 8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586) ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602) bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571) 0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592) 516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530) 5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598) ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595) e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562) 3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516) f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547) b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532) 203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583) 2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
…ks while purging deleted snapshots (apache#8554) (cherry picked from commit ad5a507) Conflicts: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/service/TestSnapshotDeletingServiceIntegrationTest.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java Change-Id: I288073796e0aa7e0a38995c2930fc638fffe5c9b
What changes were proposed in this pull request?
Currently SnapshotDeletingService waits for KeyDeletingService & DirectoryDeletingService to finish before it starts to move entries from the deleted snapshot to the next snapshot in the chain. With ReclaimableFilter implementation both KeyDeletingService & DirectoryDeletingService are going to hold read locks on the previous snapshot in the chain. Now before processing a deleted snapshot the snapshot deleting service can just acquire write locks on the deleted snapshot and next snapshot in the chain which would prevent any parallel instance of KeyDeletingService or DirectoryDeletingService running on an inconsistent state of the snapshot.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13035
How was this patch tested?
Additional unit tests and integration tests