Skip to content

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Nov 26, 2025

What changes were proposed in this pull request?

A snapshot handle present inside a snapshot cache can be reused. Once the snapshot content lock is taken a rocksdb handle could be still present in the snapshot cache and after the atomic switch the old db handle could still get reused for writes thus missing potential writes on the newly switched db. Hence while performing the atomic switch a snapshot cache lock should be acquired to ensure the next db handle would always be from the new rocksdb handle. So deletion of older directories should be under the snapshot content lock.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14015

How was this patch tested?

Working on all unit tests for Defrag service in a separate jira

…under a SNAPSHOT_DB_LOCK

Change-Id: I69ab849dbace2ee1eb0baf4cc508adaa24453e9b
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a concurrency issue in the snapshot defragmentation service by ensuring that the deletion of old snapshot checkpoint directories occurs under the snapshot content lock. The change prevents a race condition where cached RocksDB handles could be reused for writes after an atomic snapshot database switch, potentially causing writes to be directed to the old (soon-to-be-deleted) database instead of the new one.

  • Moved deleteSnapshotCheckpointDirectories call inside the snapshot content lock scope
  • Ensures proper sequencing of atomic DB switch and directory cleanup operations
  • Maintains proper exception handling with the lock released in the finally block

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

checkpointMetadataManager = null;
// Switch the snapshot DB location to the new version.
previousVersion = atomicSwitchSnapshotDB(snapshotId, checkpointLocation);
omSnapshotManager.deleteSnapshotCheckpointDirectories(snapshotId, previousVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- if deleteSnapshotCheckpointDirectories() here needs to be protected inside a snapshot content lock, do we need to have snapshot content lock too in OMSnapshotPurgeResponse? deleteSnapshotCheckpointDirectories() is invoked there too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't need SnapshotContentLock in purge as the snapshot is already going to deleted. DeleteSnapshotCheckpointDirectories would acquire snapshot db lock and ensure all handles for existing db and evict the instance from the cache altogether. This would ensure before another thread picks up the snapshot for writing we always pick the newest instance that we have moved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it.

@jojochuang jojochuang marked this pull request as ready for review November 26, 2025 21:35
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

checkpointMetadataManager = null;
// Switch the snapshot DB location to the new version.
previousVersion = atomicSwitchSnapshotDB(snapshotId, checkpointLocation);
omSnapshotManager.deleteSnapshotCheckpointDirectories(snapshotId, previousVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title doesn't correctly reflect the change?

Copy link
Contributor

@smengcl smengcl Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title should be:

Delete older snapshot checkpoint dirs under the snapshot content lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@swamirishi swamirishi changed the title HDDS-14015. Atomic Switch of Snapshot db after defrag should be done under a SNAPSHOT_DB_LOCK HDDS-14015. Delete older snapshot checkpoint dirs under the snapshot content lock Nov 26, 2025
@swamirishi swamirishi merged commit 0a9df7b into apache:master Nov 26, 2025
63 checks passed
@swamirishi
Copy link
Contributor Author

thank you @jojochuang and @smengcl fir reviewing the patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants