Skip to content

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented May 17, 2025

What changes were proposed in this pull request?

Implementing a lock for this snapshot cache would prevent any newer threads from requesting a snapshot rocksdb handle from the snapshot cache. Thus any operation under this lock will have a consistent view of the entire snapshot. Creating a PR with an alternate approach which is lesser complex in implementation in comparison to #7745
The implementation uses are read lock whenever a handle to a rocksdb is requested. To acquire a complete cache lock write lock is taken and all rocksdb instances in the cache are closed. Till the write lock is held no thread would be able to open the rocksdb instance

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12134

How was this patch tested?

Adding unit tests.

swamirishi added 12 commits May 13, 2025 19:26
Change-Id: Id3202dbe7a5a71c43526e80af34b18dd7b2ed66e
Change-Id: I1712f2dd4138a1f679eb4e4676b3d5ac41c80583
Change-Id: Id4a53d3eb8f23031ef517b3793dc73b067b11d62
Change-Id: Ie08fad6a07a66c75f6d823a64a7d278d1bfb54bd
Change-Id: Ibaa9a25cfc9126593b9ec3ec913c4a09e0846258
Change-Id: If61d109c573652c586b739baae49980749a4f9fa
Change-Id: I2b6ea2084bbb0320bf35608002faccf52b466b66
Change-Id: I8a11ed9a9cccb70b0cc05b295c0e5289fc859215
Change-Id: If1f9d066e44fe3a324aff71278a0ca1858571166
…d of ReferenceCounted

Change-Id: Id2004bf073158466ef853c99fdd76fb2e141dcf8
Change-Id: Ia1e9509215539793691d416fb46a1d37ab28267b

# Conflicts:
#	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/filter/AbstractReclaimableFilterTest.java
Change-Id: I908b0f9b33cce8468ab2d12ab758a5c91f36c559
Change-Id: If7e5de4db6dd58055623bf1b5dbea54b235a38f2
Change-Id: I14dab7c47c15919567739bb05a529073e3e169f6
@ivandika3 ivandika3 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label May 17, 2025
Change-Id: I86f15d12cc3f2cd95d0587078d0abd3db47aac1e
Change-Id: Icd82e9e51aa8732c7f23cd9bbb6010719aa2b688
Change-Id: I533d58c79430f40b9169b04e24e27820456446ca
Change-Id: Iba25fc7cac35fa455c3970fe121d21af61c7d331

# Conflicts:
#	hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/lock/OzoneManagerLock.java
Change-Id: I2ba80411d73e2f44b212983986d0edd4547a7840
Change-Id: I2bd4e8134f51d601bad5e17f9ed5db9fafb61d24

# Conflicts:
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
Change-Id: Ice8ddf3417dac002448cd1934259645af9b2acc1

# Conflicts:
#	hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/lock/IOzoneManagerLock.java
#	hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/lock/OzoneManagerLock.java
#	hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/lock/TestOzoneManagerLock.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SstFilteringService.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotPurgeResponse.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
#	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestKeyDeletingService.java
Change-Id: I9e71122babf06d2101ea9d17b345ce3f07821c72
Change-Id: I93cf8ecee91616ba1956b0f04303d822fc0b4e79
@smengcl smengcl requested a review from Copilot May 29, 2025 15:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a snapshot cache lock mechanism to ensure consistent RocksDB handle views during OM bootstrap and related operations.

  • Introduce IOzoneManagerLock and pass it into SnapshotCache to guard reads and writes.
  • Refactor OzoneManagerLock/OmReadOnlyLock/IOzoneManagerLock to support flat‐resource write locks.
  • Update snapshot‐related code, responses, services, and tests to use the new SNAPSHOT_DB_LOCK and verify lock behavior.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TestSnapshotDiffManager.java Inject OmReadOnlyLock into SnapshotCache in test setup
TestSnapshotCache.java Add parameterized tests for read/write lock behavior in SnapshotCache
SnapshotCache.java Add lock field, acquire/release read locks, and new write‐lock API (lock())
OMSnapshotPurgeResponse.java Replace leveled SNAPSHOT_LOCK with flat SNAPSHOT_DB_LOCK
SstFilteringService.java Update lock calls to use SNAPSHOT_DB_LOCK
OmSnapshotManager.java Pass metadata manager’s lock into SnapshotCache constructor
TestOzoneManagerLock.java Convert lock tests to parameterized and include flat‐resource scenarios
OzoneManagerLock.java Refactor locking internals, add acquireResourceWriteLock/releaseResourceWriteLock, and new flat resource
OmReadOnlyLock.java Implement no‐op flat‐resource write‐lock methods
IOzoneManagerLock.java Extend interface to declare resource‐level write‐lock methods
Comments suppressed due to low confidence (1)

hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/lock/TestOzoneManagerLock.java:289

  • The code uses Arrays::stream but java.util.Arrays is not imported, causing a compile error. Add import java.util.Arrays;.
for (Resource resource : Stream.of(LeveledResource.values(), FlatResource.values())

Comment on lines 374 to 375
SnapshotCache snapshotCache = new SnapshotCache(mockCacheLoader(), 10, omMetrics, 0,
new OmReadOnlyLock());
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using OmReadOnlyLock in this test will cause all get() calls to fail (lock always not acquired). Consider injecting a lock stub or the real OzoneManagerLock that grants read locks so the test can exercise cache behavior.

Suggested change
SnapshotCache snapshotCache = new SnapshotCache(mockCacheLoader(), 10, omMetrics, 0,
new OmReadOnlyLock());
OzoneManagerLock mockLock = mock(OzoneManagerLock.class);
when(mockLock.acquireReadLock(anyString())).thenReturn(true);
SnapshotCache snapshotCache = new SnapshotCache(mockCacheLoader(), 10, omMetrics, 0,
mockLock);

Copilot uses AI. Check for mistakes.
* are performed in a thread-safe manner. The returned supplier automatically releases the
* lock when closed, preventing potential resource contention or deadlocks.
*/
public UncheckedAutoCloseableSupplier<OMLockDetails> lock() {
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The method name lock() is very generic and may be confused with the lock field. Consider renaming it to acquireSnapshotDbWriteLock() or similar for clarity.

Copilot uses AI. Check for mistakes.
* If cache size exceeds soft limit, attempt to clean up and close the
instances that has zero reference count.
*/
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there's no test code that exercises this call so this annotation isn't needed and it can be made private.
Or we can add a test code for this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already testcases for testing snapshot eviction

@VisibleForTesting
void cleanup() {
if (dbMap.size() > cacheSizeLimit) {
synchronized void cleanup(boolean force) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this method requires synchronized, this call is going to be expensive. should we make closing of OmSnapshot object asynchronous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closing snapshot is asynchronous already.

@smengcl
Copy link
Contributor

smengcl commented Jun 3, 2025

This one needs to be rebased

Change-Id: I10524fb31cbd435b33124e6324455f8bd4a9e675

# Conflicts:
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
#	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java
#	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
Change-Id: Ia604d3921be35f133f6f1758c26c78ab787d41a1
@swamirishi
Copy link
Contributor Author

This one needs to be rebased

Done

@smengcl smengcl marked this pull request as ready for review June 8, 2025 01:21
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swamirishi . lgtm apart from the comment nit. Pending CI.

Change-Id: I7d310d0f07a3d81b6921f308ee95e9b9ddc713a7
@swamirishi
Copy link
Contributor Author

Thank you for reviewing the PR @jojochuang & @smengcl

@swamirishi swamirishi merged commit e936e4d into apache:master Jun 8, 2025
42 checks passed
@swamirishi
Copy link
Contributor Author

@sadanand48 We can use the snapshot cache api for the Bootstrap fixes. Please make the changes in #8477 or a follow up PR to make the bootstrap consistent.

aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 9, 2025
…239-container-reconciliation

Commits: 80 commits
5e273a4 HDDS-12977. Fail build on dependency problems (apache#8574)
5081ba2 HDDS-13034. Refactor DirectoryDeletingService to use ReclaimableDirFilter and ReclaimableKeyFilter (apache#8546)
e936e4d HDDS-12134. Implement Snapshot Cache lock for OM Bootstrap (apache#8474)
31d13de HDDS-13165. [Docs] Python client developer guide. (apache#8556)
9e6955e HDDS-13205. Bump common-custom-user-data-maven-extension to 2.0.3 (apache#8581)
750b629 HDDS-13203. Bump Bouncy Castle to 1.81 (apache#8580)
ba5177e HDDS-13202. Bump build-helper-maven-plugin to 3.6.1 (apache#8579)
07ee5dd HDDS-13204. Bump awssdk to 2.31.59 (apache#8582)
e1964f2 HDDS-13201. Bump jersey2 to 2.47 (apache#8578)
81295a5 HDDS-13013. [Snapshot] Add metrics and tests for snapshot operations. (apache#8436)
b3d75ab HDDS-12976. Clean up unused dependencies (apache#8521)
e0f08b2 HDDS-13179. rename-generated-config fails on re-compile without clean (apache#8569)
f388317 HDDS-12554. Support callback on completed reconfiguration (apache#8391)
c13a3fe HDDS-13154 Link more Grafana dashboard json files to the Observability user doc (apache#8533)
2a761f7 HDDS-11967. [Docs]DistCP Integration in Kerberized environment. (apache#8531)
81fc4c4 HDDS-12550. Use DatanodeID instead of UUID in NodeManager CommandQueue. (apache#8560)
2360af4 HDDS-13169. Intermittent failure in testSnapshotOperationsNotBlockedDuringCompaction (apache#8553)
f19789d HDDS-13170. Reclaimable filter should always reclaim entries when buckets and volumes have already been deleted (apache#8551)
315ef20 HDDS-13175. Leftover reference to OM-specific trash implementation (apache#8563)
902e715 HDDS-13159. Refactor KeyManagerImpl for getting deleted subdirectories and deleted subFiles (apache#8538)
46a93d0 HDDS-12817. Addendum rename ecIndex to replicaIndex in chunkinfo output (apache#8552)
19b9b9c HDDS-13166. Set pipeline ID in BlockExistenceVerifier to avoid cached pipeline with different node (apache#8549)
b3ff67c HDDS-13068. Validate Container Balancer move timeout and replication timeout configs (apache#8490)
7a7b9a8 HDDS-13139. Introduce bucket layout flag in freon rk command (apache#8539)
3c25e7d HDDS-12595. Add verifier for container replica states (apache#8422)
6d59220 HDDS-13104. Move auditparser acceptance test under debug (apache#8527)
8e8c432 HDDS-13071. Documentation for Container Replica Debugger Tool (apache#8485)
0e8c8d4 HDDS-13158. Bump junit to 5.13.0 (apache#8537)
8e552b4 HDDS-13157. Bump exec-maven-plugin to 3.5.1 (apache#8534)
168f690 HDDS-13155. Bump jline to 3.30.4 (apache#8535)
cc1e4d1 HDDS-13156. Bump awssdk to 2.31.54 (apache#8536)
3bfb7af HDDS-13136. KeyDeleting Service should not run for already deep cleaned snapshots (apache#8525)
006e691 HDDS-12503. Compact snapshot DB before evicting a snapshot out of cache (apache#8141)
568b228 HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past (apache#8491)
53673c5 HDDS-11244. OmPurgeDirectoriesRequest should clean up File and Directory tables of AOS for deleted snapshot directories (apache#8509)
07f4868 HDDS-13099. ozone admin datanode list ignores --json flag when --id filter is used (apache#8500)
08c0ab8 HDDS-13075. Fix default value in description of container placement policy configs (apache#8511)
58c87a8 HDDS-12177. Set runtime scope where missing (apache#8513)
10c470d HDDS-12817. Add EC block index in the ozone debug replicas chunk-info (apache#8515)
7027ab7 HDDS-13124. Respect config hdds.datanode.use.datanode.hostname when reading from datanode (apache#8518)
b8b226c HDDS-12928. datanode min free space configuration (apache#8388)
fd3d70c HDDS-13026. KeyDeletingService should also delete RenameEntries (apache#8447)
4c1c6cf HDDS-12714. Create acceptance test framework for debug and repair tools (apache#8510)
fff80fc HDDS-13118. Remove duplicate mockito-core dependency from hdds-test-utils (apache#8508)
10d5555 HDDS-13115. Bump awssdk to 2.31.50 (apache#8505)
360d139 HDDS-13017. Fix warnings due to non-test scoped test dependencies (apache#8479)
1db1cca HDDS-13116. Bump jline to 3.30.3 (apache#8504)
322ca93 HDDS-13025. Refactor KeyDeletingService to use ReclaimableKeyFilter (apache#8450)
988b447 HDDS-5287. Document S3 ACL classes (apache#8501)
64bb29d HDDS-12777. Use module-specific name for generated config files (apache#8475)
54ed115 HDDS-9210. Update snapshot chain restore test to incorporate snapshot delete. (apache#8484)
87dfa5a HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance (apache#8438)
7cdc865 HDDS-13100. ozone admin datanode list --json should output a newline at the end (apache#8499)
9cc4194 HDDS-13089. [snapshot] Add an integration test to verify snapshotted data can be read by S3 SDK client (apache#8495)
cb9867b HDDS-13065. Refactor SnapshotCache to return AutoCloseSupplier instead of ReferenceCounted (apache#8473)
a88ff71 HDDS-10979. Support STANDARD_IA S3 storage class to accept EC replication config (apache#8399)
6ec8f85 HDDS-13080. Improve delete metrics to show number of timeout DN command from SCM (apache#8497)
3bb8858 HDDS-12378. Change default hdds.scm.safemode.min.datanode to 3 (apache#8331)
0171bef HDDS-13073. Set pipeline ID in checksums verifier to avoid cached pipeline with different node (apache#8480)
5c7726a HDDS-11539. OzoneClientCache `@PreDestroy` is never called (apache#8493)
a8ed19b HDDS-13031. Implement a Flat Lock resource in OzoneManagerLock (apache#8446)
e9e8b30 HDDS-12935. Support unsigned chunked upload with STREAMING-UNSIGNED-PAYLOAD-TRAILER (apache#8366)
7590268 HDDS-13079. Improve logging in DN for delete operation. (apache#8489)
435fe7e HDDS-12870. Fix listObjects corner cases (apache#8307)
eb5dabd HDDS-12926. Remove *.tmp.* exclusion in DU (apache#8486)
eeb98c7 HDDS-13030. Snapshot Purge should unset deep cleaning flag for next 2 snapshots in the chain (apache#8451)
6bf121e HDDS-13032. Support proper S3OwnerId representation (apache#8478)
5d1b43d HDDS-13076. Refactor OzoneManagerLock class to rename Resource class to LeveledResource (apache#8482)
bafe6d9 HDDS-13064. [snapshot] Add test coverage for SnapshotUtils.isBlockLocationInfoSame() (apache#8476)
7035846 HDDS-13040. Add user doc highlighting the difference between Ozone ACL and S3 ACL. (apache#8457)
1825cdf HDDS-13049. Deprecate VolumeName & BucketName in OmKeyPurgeRequest and prevent Key version purge on Block Deletion Failure (apache#8463)
211c76c HDDS-13060. Change NodeManager.addDatanodeCommand(..) to use DatanodeID (apache#8471)
f410238 HDDS-13061. Add test for key ACL operations without permission (apache#8472)
d1a2f48 HDDS-13057. Increment block delete processed transaction counts regardless of log level (apache#8466)
0cc6fcc HDDS-13043. Replace != with assertNotEquals in TestSCMContainerPlacementRackAware (apache#8470)
e1c779a HDDS-13051. Use DatanodeID in server-scm. (apache#8465)
35e1126 HDDS-13042. [snapshot] Add future proofing test cases for unsupported file system API (apache#8458)
619c05d HDDS-13008. Exclude same SST files when calculating full snapdiff (apache#8423)
21b49d3 HDDS-12965. Fix warnings about "used undeclared" dependencies (apache#8468)
8136119 HDDS-13048. Create new module for Recon integration tests (apache#8464)

Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java
@adoroszlai
Copy link
Contributor

adoroszlai commented Jun 11, 2025

@swamirishi Have you noticed that this increased time for integration (flaky) by 25 minutes?

Most of the difference seems to be in TestSnapshotDeletingServiceIntegrationTest.

Before:

[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 168.2 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.12 s -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest

After:

[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 168.8 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 449.1 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.8 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 445.3 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 155.1 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest
[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 441.3 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotDeletingServiceIntegrationTest

Created HDDS-13244.

@swamirishi
Copy link
Contributor Author

TestSnapshotDeletingServiceIntegrationTest
Is this notice only after this got merged?

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jun 13, 2025
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Jun 16, 2025
swamirishi added a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
apache#8474)

(cherry picked from commit e936e4d)

 Conflicts:
	hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/lock/TestOzoneManagerLock.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotPurgeResponse.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java
	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java

Change-Id: I9795bfd1d3d8e3fdfa3cdcfd6a571452196b8b3f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants