HDDS-13783. Implement locks for OmSnapshotLocalDataManager#9140
HDDS-13783. Implement locks for OmSnapshotLocalDataManager#9140swamirishi merged 83 commits intoapache:masterfrom
Conversation
Change-Id: Iba47aeb21663dfa407ab71339cef02c0d74b49f2
Change-Id: Ifd2feca1fddb144e4955db025f0b15a2ab1f3bfe
Change-Id: I34536ff06efb7d5a4942853f0fd83942ab398b5f
…otLocalDataManager Change-Id: I34536ff06efb7d5a4942853f0fd83942ab398b5f
Change-Id: I32bcaf2a1fb290f1790c02872a0230cd65586636
Change-Id: I105a2e8178c0444d52de41b99801f4ceb6d57ffd # Conflicts: # hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/Checksum.java # hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/ObjectSerializer.java # hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/YamlSerializer.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalDataYaml.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java # hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotLocalDataYaml.java
Change-Id: I985170e38fb8beeb784048e85a08a4c79e1aec97
Change-Id: I33e6e6e825bf23c323ad7ed593d800a11720fa4f
Change-Id: If30b2c766db82adde72145c8ecd3e590ef54cc2d
Change-Id: Id3f2c49050bc3476b9e0f5f51dacb6d9acc4c2f7
Change-Id: I432960725b4c6c55aa906b5780cc3027e41e10db
Change-Id: I3c5514e5bbd251a2b5297d8f074cfde5c71fa543
Change-Id: Ib5a9e6c91bdccba17820263c47eaf2c8400e930d
Change-Id: Ica36e0615c7bc6aa9b6a7f6fafafd0f830d4bafb
Change-Id: I26b66f266bb7677e4b1078f5fcd9f2ce3a651a70 # Conflicts: # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Change-Id: I1d93dbc048a42cc55ff1f8ffa420e52f967527b8
Change-Id: I34202928a7a367dd0a1e57219317ff34de352b78
Change-Id: Iad6f26cb71ec921c51ee2d138745df1a2663533f
Change-Id: Ic5f7e249cfb9cb3973cbcd4abd36b22a6ff8f5aa
…calDataProvider Change-Id: I3a004b4b435075a4348960aeed642e8da71e7e72
Change-Id: I06990bc9ab8fc7e1eb7bec255646a650bd8c35fe
Change-Id: I4c6c61c83aa9fadab8ecef854b99dcc0a89a2208
Change-Id: I0e476322372a302572f1fe79cbf2e874bfeac2ed
Change-Id: I3849387d064e093634e69cdaf870d27c1934cda5 # Conflicts: # hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/ObjectSerializer.java # hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/YamlSerializer.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalData.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalDataYaml.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java # hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Change-Id: Ie5e5f3dab4324103e8855dd15619d7755f0422e6
Change-Id: I55bd5c3ef7fc32910a9111328638de2edffcd541
There was a problem hiding this comment.
Pull Request Overview
Implements lock management for OmSnapshotLocalDataManager to avoid race conditions when reading/writing snapshot local YAML, and refactors APIs to return scoped providers that hold locks during use.
- Introduces Readable/Writable provider classes that acquire hierarchical locks per snapshot and resolve version mapping along the snapshot chain.
- Adds a fullLock (ReadWriteLock) for protecting in-memory graph/version metadata and atomically writes YAML via temp file + atomic move.
- Updates tests to validate lock acquisition/release order and version resolution/validation; adds new FlatResource.SNAPSHOT_LOCAL_DATA_LOCK.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java | Core changes: locking, provider classes, version resolution logic, atomic commit, and validation. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalData.java | Changes API for addVersionSSTFileInfos to take LiveFileMetaData, adds removal method and setter on VersionMeta. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/lock/FlatResource.java | Adds SNAPSHOT_LOCAL_DATA_LOCK for per-snapshot YAML operations. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java | Uses Readable provider in try-with-resources for safe access. |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestOmSnapshotLocalDataManager.java | Extensive new tests for lock ordering, version resolution/validation, and updated APIs. |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotManager.java | Ensures previous snapshot IDs are null for idempotent snapshot creation tests. |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotLocalDataYaml.java | Adjusts tests to provide LiveFileMetaData instead of SstFileInfo. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| for (Map.Entry<Integer, LocalDataVersionNode> entry : previousVersionNodeMap.entrySet()) { | ||
| Set<LocalDataVersionNode> versionNode = localDataGraph.successors(entry.getValue()); | ||
| if (versionNode.size() > 1) { | ||
| throw new IOException(String.format("Snapshot %s version %d has multiple successors %s", | ||
| currentIteratedSnapshotId, entry.getValue(), versionNode)); | ||
| } | ||
| if (versionNode.isEmpty()) { | ||
| throw new IOException(String.format("Snapshot %s version %d doesn't have successor", | ||
| currentIteratedSnapshotId, entry.getValue())); | ||
| } | ||
| // Set the version node for iterated version to the successor corresponding to the previous snapshot id. | ||
| entry.setValue(versionNode.iterator().next()); |
There was a problem hiding this comment.
The graph is built with edges from a snapshot version to its previous version (current -> previous). To walk 'forward' along the chain from a previous snapshot to its dependent, you must use predecessors(), not successors(). Using successors() here will either return empty or the wrong node, breaking version resolution. Replace successors(...) with predecessors(...), and update the error messages accordingly.
| private WritableOmSnapshotLocalDataProvider(UUID snapshotId) throws IOException { | ||
| super(snapshotId, false); | ||
| fullLock.readLock().lock(); | ||
| } | ||
|
|
||
| private WritableOmSnapshotLocalDataProvider(UUID snapshotId, UUID snapshotIdToBeResolved) throws IOException { | ||
| super(snapshotId, false, null, snapshotIdToBeResolved, true); | ||
| fullLock.readLock().lock(); | ||
| } |
There was a problem hiding this comment.
WritableOmSnapshotLocalDataProvider acquires fullLock.readLock() while performing mutations (commit/upsert) to versionNodeMap/localDataGraph. This can race with readers and other writers. Use fullLock.writeLock() for writes and hold fullLock.readLock() in ReadableOmSnapshotLocalDataProvider during graph reads. Specifically: change readLock() to writeLock() in these constructors and in close(); and acquire/release fullLock.readLock() in ReadableOmSnapshotLocalDataProvider to protect versionNodeMap/localDataGraph access.
| @Override | ||
| public void close() throws IOException { | ||
| super.close(); | ||
| fullLock.readLock().unlock(); |
There was a problem hiding this comment.
WritableOmSnapshotLocalDataProvider acquires fullLock.readLock() while performing mutations (commit/upsert) to versionNodeMap/localDataGraph. This can race with readers and other writers. Use fullLock.writeLock() for writes and hold fullLock.readLock() in ReadableOmSnapshotLocalDataProvider during graph reads. Specifically: change readLock() to writeLock() in these constructors and in close(); and acquire/release fullLock.readLock() in ReadableOmSnapshotLocalDataProvider to protect versionNodeMap/localDataGraph access.
| fullLock.readLock().unlock(); | |
| fullLock.writeLock().unlock(); |
There was a problem hiding this comment.
We need a readLock here since we are already acquiring locks on individual snapshots. This lock is for race condition between bootstrap and other creates
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalData.java
Show resolved
Hide resolved
| } | ||
|
|
||
| /** | ||
| * Intializer the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the |
There was a problem hiding this comment.
Corrected spelling of 'Intializer' to 'Initializes'.
| * Intializer the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the | |
| * Initializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the |
| "key1", "key2")), 3); | ||
|
|
||
| IOException ex = assertThrows(IOException.class, omSnapshotLocalDataProvider::commit); | ||
| System.out.println(ex.getMessage()); |
There was a problem hiding this comment.
Avoid printing to stdout in unit tests; it adds noise to test logs. Please remove the System.out.println(...) line.
| System.out.println(ex.getMessage()); |
Change-Id: I165f00132548acf920b8fb9d7530a6314366797d
Change-Id: I3df543d896463f24ba3b69fce1b2f655af612dc6
Change-Id: I94cf4b82b2b620f480e2d1e01e6d94a6679d974e
Change-Id: I661e61e04031c1bcd537024e0a0859a6d6aeaffd
Change-Id: I59cab67d93f0359bca54c0c8119f5018167b7d1a
Change-Id: I950b4d0cc45a74369b8efa36deaf947db4cb35bc
Change-Id: Ib21e10e8fdf518928f238c9daae6672d679b7ae4
This reverts commit 48ec0bb.
Change-Id: I37d2f5c07f405f3069a4cb99881d5d1e67110e79
Change-Id: I119e8cede140c755d3cd09c0a56234ff4906be98
Change-Id: I6bbcdac1acb389ee3c08efb37676cccdb3d783c6
| @@ -75,6 +93,7 @@ public void computeAndSetChecksum(Yaml yaml, OmSnapshotLocalData data) throws IO | |||
| } | |||
| }; | |||
| this.versionNodeMap = new HashMap<>(); | |||
There was a problem hiding this comment.
Is ConcurrentHashMap better?
There was a problem hiding this comment.
We need not use ConcurrentHashMap since graph is not concurrent and can be only updated inside a synchrnized block. If you see both the Map and graph data structure only gets updated in the upsertNode function which is synchronized
There was a problem hiding this comment.
If we don't use ConcurrentHashMap, all read path also need to be in synchronized, otherwise other threads may not see the write result.
There was a problem hiding this comment.
why so? We are always acquiring a readLock or write lock before accessing the map. We cannot access his in a thread safe way anyhow without acquiring lock
There was a problem hiding this comment.
We can read in parallel when there is no write lock involved
There was a problem hiding this comment.
locking is good but lock alone may not be sufficient.
see Java memory model. related: https://stackoverflow.com/a/15794355
There was a problem hiding this comment.
Yeah I have synchornized reads and writes on concurrentHashMap and localDataGraph using an internalReadWriteLock
Change-Id: I36c48585d17042491f7dc61a84845c999802d1fa
| private HierarchicalResourceLockManager locks; | ||
|
|
||
| public OmSnapshotLocalDataManager(OMMetadataManager omMetadataManager) throws IOException { | ||
| this.localDataGraph = GraphBuilder.directed().build(); |
There was a problem hiding this comment.
The same applies to localDataGraph
There was a problem hiding this comment.
Unfortunately there is no concurrent graph data structure implementation library. We will have to implement one I feel this might be an overkill
There was a problem hiding this comment.
I agree.
Just make sure all r/w to this graph are in synchronized then we should be good.
There was a problem hiding this comment.
I have made all writes and reads synchronized by using a readWrite lock
Change-Id: I2f7cb5ec772d2d2de99e8d3cba8bdc34e3f36efd
Change-Id: Ida628ebab658b2a31fe5d654f19035e96f3163ac
Change-Id: I960fd1d25e0f323d56f5e88ffb03261d777d9021
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| /** | ||
| * Intializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the |
There was a problem hiding this comment.
Corrected spelling of 'Intializes' to 'Initializes'.
| * Intializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the | |
| * Initializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the |
| * Depending on read or write locks are acquired on the snapshotId and read lock is acquired on the previous | ||
| * snapshot. Once the instance is closed the read lock on previous snapshot is released followed by releasing the | ||
| * lock on the snapshotId. | ||
| * @param read |
There was a problem hiding this comment.
| if (versionNode != null && localDataGraph.inDegree(versionNode) != 0) { | ||
| Set<LocalDataVersionNode> versionNodes = localDataGraph.predecessors(versionNode); | ||
| throw new IOException(String.format("Cannot remove Snapshot %s with version : %d since it still has " + | ||
| "predecessors : %s", snapshotId, version, versionNodes)); |
There was a problem hiding this comment.
The error message uses 'predecessors' when referring to nodes that depend on this version. Based on the logic checking inDegree, these are actually successors (dependents). Consider changing 'predecessors' to 'successors' or 'dependents' for clarity.
| "predecessors : %s", snapshotId, version, versionNodes)); | |
| "successors : %s", snapshotId, version, versionNodes)); |
| @@ -274,7 +279,7 @@ public OmSnapshotLocalData copyObject() { | |||
| * maintain immutability. | |||
| */ | |||
| public static class VersionMeta implements CopyObject<VersionMeta> { | |||
There was a problem hiding this comment.
The field previousSnapshotVersion is changed from final to mutable without a clear design comment explaining why mutability is required. Consider adding a comment explaining the use case for mutation (appears to be for version resolution during lock acquisition).
| public static class VersionMeta implements CopyObject<VersionMeta> { | |
| public static class VersionMeta implements CopyObject<VersionMeta> { | |
| /** | |
| * The version of the previous snapshot. This field is mutable to allow | |
| * version resolution during lock acquisition and other update scenarios. | |
| * Although the class is generally intended to be immutable, mutability | |
| * of this field is required for certain internal operations where the | |
| * previous snapshot version may need to be updated after object creation. | |
| */ |
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Show resolved
Hide resolved
Change-Id: I1bb9832e7e8c40deeccb9d0868eaf5772f39b7f9
|
Thank you @smengcl & @jojochuang for reviewing the PR |
|
@swamirishi please set fix version when resolving Jira issue after PR merge |
What changes were proposed in this pull request?
Locks need to be taken while updating the LocalSnapshotData yaml file which is important to avoid race conditions with other threads reading the yaml contents.
This depends on #9159 #9160
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13783
How was this patch tested?
Adding additional unit tests