Skip to content

HDDS-13783. Implement locks for OmSnapshotLocalDataManager#9140

Merged
swamirishi merged 83 commits intoapache:masterfrom
swamirishi:HDDS-13783
Oct 29, 2025
Merged

HDDS-13783. Implement locks for OmSnapshotLocalDataManager#9140
swamirishi merged 83 commits intoapache:masterfrom
swamirishi:HDDS-13783

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Oct 12, 2025

What changes were proposed in this pull request?

Locks need to be taken while updating the LocalSnapshotData yaml file which is important to avoid race conditions with other threads reading the yaml contents.
This depends on #9159 #9160

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13783

How was this patch tested?

Adding additional unit tests

Change-Id: Iba47aeb21663dfa407ab71339cef02c0d74b49f2
Change-Id: Ifd2feca1fddb144e4955db025f0b15a2ab1f3bfe
Change-Id: I34536ff06efb7d5a4942853f0fd83942ab398b5f
…otLocalDataManager

Change-Id: I34536ff06efb7d5a4942853f0fd83942ab398b5f
Change-Id: I32bcaf2a1fb290f1790c02872a0230cd65586636
Change-Id: I105a2e8178c0444d52de41b99801f4ceb6d57ffd

# Conflicts:
#	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/Checksum.java
#	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/ObjectSerializer.java
#	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/YamlSerializer.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalDataYaml.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
#	hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotLocalDataYaml.java
Change-Id: I985170e38fb8beeb784048e85a08a4c79e1aec97
Change-Id: I33e6e6e825bf23c323ad7ed593d800a11720fa4f
Change-Id: If30b2c766db82adde72145c8ecd3e590ef54cc2d
Change-Id: Id3f2c49050bc3476b9e0f5f51dacb6d9acc4c2f7
Change-Id: I432960725b4c6c55aa906b5780cc3027e41e10db
@swamirishi swamirishi changed the title HDDS-13783. HDDS-13783. Implement locks for OmSnapshotLocalDataManager Oct 12, 2025
Change-Id: I3c5514e5bbd251a2b5297d8f074cfde5c71fa543
Change-Id: Ib5a9e6c91bdccba17820263c47eaf2c8400e930d
Change-Id: Ica36e0615c7bc6aa9b6a7f6fafafd0f830d4bafb
Change-Id: I26b66f266bb7677e4b1078f5fcd9f2ce3a651a70

# Conflicts:
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Change-Id: I1d93dbc048a42cc55ff1f8ffa420e52f967527b8
Change-Id: I34202928a7a367dd0a1e57219317ff34de352b78
Change-Id: Iad6f26cb71ec921c51ee2d138745df1a2663533f
Change-Id: Ic5f7e249cfb9cb3973cbcd4abd36b22a6ff8f5aa
…calDataProvider

Change-Id: I3a004b4b435075a4348960aeed642e8da71e7e72
Change-Id: I06990bc9ab8fc7e1eb7bec255646a650bd8c35fe
Change-Id: I4c6c61c83aa9fadab8ecef854b99dcc0a89a2208
@swamirishi swamirishi added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Oct 13, 2025
Change-Id: I0e476322372a302572f1fe79cbf2e874bfeac2ed
Change-Id: I3849387d064e093634e69cdaf870d27c1934cda5

# Conflicts:
#	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/ObjectSerializer.java
#	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/util/YamlSerializer.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalData.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalDataYaml.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
#	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java
Change-Id: Ie5e5f3dab4324103e8855dd15619d7755f0422e6
Change-Id: I55bd5c3ef7fc32910a9111328638de2edffcd541
@jojochuang jojochuang requested a review from Copilot October 17, 2025 15:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Implements lock management for OmSnapshotLocalDataManager to avoid race conditions when reading/writing snapshot local YAML, and refactors APIs to return scoped providers that hold locks during use.

  • Introduces Readable/Writable provider classes that acquire hierarchical locks per snapshot and resolve version mapping along the snapshot chain.
  • Adds a fullLock (ReadWriteLock) for protecting in-memory graph/version metadata and atomically writes YAML via temp file + atomic move.
  • Updates tests to validate lock acquisition/release order and version resolution/validation; adds new FlatResource.SNAPSHOT_LOCAL_DATA_LOCK.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotLocalDataManager.java Core changes: locking, provider classes, version resolution logic, atomic commit, and validation.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotLocalData.java Changes API for addVersionSSTFileInfos to take LiveFileMetaData, adds removal method and setter on VersionMeta.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/lock/FlatResource.java Adds SNAPSHOT_LOCAL_DATA_LOCK for per-snapshot YAML operations.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/SnapshotDefragService.java Uses Readable provider in try-with-resources for safe access.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestOmSnapshotLocalDataManager.java Extensive new tests for lock ordering, version resolution/validation, and updated APIs.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotManager.java Ensures previous snapshot IDs are null for idempotent snapshot creation tests.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotLocalDataYaml.java Adjusts tests to provide LiveFileMetaData instead of SstFileInfo.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 487 to 498
for (Map.Entry<Integer, LocalDataVersionNode> entry : previousVersionNodeMap.entrySet()) {
Set<LocalDataVersionNode> versionNode = localDataGraph.successors(entry.getValue());
if (versionNode.size() > 1) {
throw new IOException(String.format("Snapshot %s version %d has multiple successors %s",
currentIteratedSnapshotId, entry.getValue(), versionNode));
}
if (versionNode.isEmpty()) {
throw new IOException(String.format("Snapshot %s version %d doesn't have successor",
currentIteratedSnapshotId, entry.getValue()));
}
// Set the version node for iterated version to the successor corresponding to the previous snapshot id.
entry.setValue(versionNode.iterator().next());
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph is built with edges from a snapshot version to its previous version (current -> previous). To walk 'forward' along the chain from a previous snapshot to its dependent, you must use predecessors(), not successors(). Using successors() here will either return empty or the wrong node, breaking version resolution. Replace successors(...) with predecessors(...), and update the error messages accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +570 to +578
private WritableOmSnapshotLocalDataProvider(UUID snapshotId) throws IOException {
super(snapshotId, false);
fullLock.readLock().lock();
}

private WritableOmSnapshotLocalDataProvider(UUID snapshotId, UUID snapshotIdToBeResolved) throws IOException {
super(snapshotId, false, null, snapshotIdToBeResolved, true);
fullLock.readLock().lock();
}
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WritableOmSnapshotLocalDataProvider acquires fullLock.readLock() while performing mutations (commit/upsert) to versionNodeMap/localDataGraph. This can race with readers and other writers. Use fullLock.writeLock() for writes and hold fullLock.readLock() in ReadableOmSnapshotLocalDataProvider during graph reads. Specifically: change readLock() to writeLock() in these constructors and in close(); and acquire/release fullLock.readLock() in ReadableOmSnapshotLocalDataProvider to protect versionNodeMap/localDataGraph access.

Copilot uses AI. Check for mistakes.
@Override
public void close() throws IOException {
super.close();
fullLock.readLock().unlock();
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WritableOmSnapshotLocalDataProvider acquires fullLock.readLock() while performing mutations (commit/upsert) to versionNodeMap/localDataGraph. This can race with readers and other writers. Use fullLock.writeLock() for writes and hold fullLock.readLock() in ReadableOmSnapshotLocalDataProvider during graph reads. Specifically: change readLock() to writeLock() in these constructors and in close(); and acquire/release fullLock.readLock() in ReadableOmSnapshotLocalDataProvider to protect versionNodeMap/localDataGraph access.

Suggested change
fullLock.readLock().unlock();
fullLock.writeLock().unlock();

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a readLock here since we are already acquiring locks on individual snapshots. This lock is for race condition between bootstrap and other creates

}

/**
* Intializer the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Intializer' to 'Initializes'.

Suggested change
* Intializer the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the
* Initializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the

Copilot uses AI. Check for mistakes.
"key1", "key2")), 3);

IOException ex = assertThrows(IOException.class, omSnapshotLocalDataProvider::commit);
System.out.println(ex.getMessage());
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid printing to stdout in unit tests; it adds noise to test logs. Please remove the System.out.println(...) line.

Suggested change
System.out.println(ex.getMessage());

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Change-Id: I165f00132548acf920b8fb9d7530a6314366797d
Change-Id: I3df543d896463f24ba3b69fce1b2f655af612dc6
Change-Id: I94cf4b82b2b620f480e2d1e01e6d94a6679d974e
Change-Id: I661e61e04031c1bcd537024e0a0859a6d6aeaffd
Change-Id: I59cab67d93f0359bca54c0c8119f5018167b7d1a
Change-Id: I950b4d0cc45a74369b8efa36deaf947db4cb35bc
Change-Id: Ib21e10e8fdf518928f238c9daae6672d679b7ae4
Change-Id: I37d2f5c07f405f3069a4cb99881d5d1e67110e79
Change-Id: I119e8cede140c755d3cd09c0a56234ff4906be98
Change-Id: I6bbcdac1acb389ee3c08efb37676cccdb3d783c6
@@ -75,6 +93,7 @@ public void computeAndSetChecksum(Yaml yaml, OmSnapshotLocalData data) throws IO
}
};
this.versionNodeMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ConcurrentHashMap better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need not use ConcurrentHashMap since graph is not concurrent and can be only updated inside a synchrnized block. If you see both the Map and graph data structure only gets updated in the upsertNode function which is synchronized

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use ConcurrentHashMap, all read path also need to be in synchronized, otherwise other threads may not see the write result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so? We are always acquiring a readLock or write lock before accessing the map. We cannot access his in a thread safe way anyhow without acquiring lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can read in parallel when there is no write lock involved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locking is good but lock alone may not be sufficient.

see Java memory model. related: https://stackoverflow.com/a/15794355

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have synchornized reads and writes on concurrentHashMap and localDataGraph using an internalReadWriteLock

Change-Id: I36c48585d17042491f7dc61a84845c999802d1fa
private HierarchicalResourceLockManager locks;

public OmSnapshotLocalDataManager(OMMetadataManager omMetadataManager) throws IOException {
this.localDataGraph = GraphBuilder.directed().build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to localDataGraph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there is no concurrent graph data structure implementation library. We will have to implement one I feel this might be an overkill

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Just make sure all r/w to this graph are in synchronized then we should be good.

Copy link
Contributor Author

@swamirishi swamirishi Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made all writes and reads synchronized by using a readWrite lock

Change-Id: I2f7cb5ec772d2d2de99e8d3cba8bdc34e3f36efd
@swamirishi swamirishi requested a review from sadanand48 October 27, 2025 15:33
Change-Id: Ida628ebab658b2a31fe5d654f19035e96f3163ac
Change-Id: I960fd1d25e0f323d56f5e88ffb03261d777d9021
@smengcl smengcl requested a review from Copilot October 29, 2025 03:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

/**
* Intializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Intializes' to 'Initializes'.

Suggested change
* Intializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the
* Initializes the snapshot local data by acquiring the lock on the snapshot and also acquires a read lock on the

Copilot uses AI. Check for mistakes.
* Depending on read or write locks are acquired on the snapshotId and read lock is acquired on the previous
* snapshot. Once the instance is closed the read lock on previous snapshot is released followed by releasing the
* lock on the snapshotId.
* @param read
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @param 'read' lacks a description. Add a description such as '@param read if true, acquire read lock; if false, acquire write lock'.

Suggested change
* @param read
* @param read if true, acquire read lock; if false, acquire write lock

Copilot uses AI. Check for mistakes.
if (versionNode != null && localDataGraph.inDegree(versionNode) != 0) {
Set<LocalDataVersionNode> versionNodes = localDataGraph.predecessors(versionNode);
throw new IOException(String.format("Cannot remove Snapshot %s with version : %d since it still has " +
"predecessors : %s", snapshotId, version, versionNodes));
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses 'predecessors' when referring to nodes that depend on this version. Based on the logic checking inDegree, these are actually successors (dependents). Consider changing 'predecessors' to 'successors' or 'dependents' for clarity.

Suggested change
"predecessors : %s", snapshotId, version, versionNodes));
"successors : %s", snapshotId, version, versionNodes));

Copilot uses AI. Check for mistakes.
@@ -274,7 +279,7 @@ public OmSnapshotLocalData copyObject() {
* maintain immutability.
*/
public static class VersionMeta implements CopyObject<VersionMeta> {
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field previousSnapshotVersion is changed from final to mutable without a clear design comment explaining why mutability is required. Consider adding a comment explaining the use case for mutation (appears to be for version resolution during lock acquisition).

Suggested change
public static class VersionMeta implements CopyObject<VersionMeta> {
public static class VersionMeta implements CopyObject<VersionMeta> {
/**
* The version of the previous snapshot. This field is mutable to allow
* version resolution during lock acquisition and other update scenarios.
* Although the class is generally intended to be immutable, mutability
* of this field is required for certain internal operations where the
* previous snapshot version may need to be updated after object creation.
*/

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Change-Id: I1bb9832e7e8c40deeccb9d0868eaf5772f39b7f9
@swamirishi
Copy link
Contributor Author

Thank you @smengcl & @jojochuang for reviewing the PR

@swamirishi swamirishi merged commit b39bac0 into apache:master Oct 29, 2025
43 checks passed
@adoroszlai
Copy link
Contributor

@swamirishi please set fix version when resolving Jira issue after PR merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants