-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-14015. Delete older snapshot checkpoint dirs under the snapshot content lock #9380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…under a SNAPSHOT_DB_LOCK Change-Id: I69ab849dbace2ee1eb0baf4cc508adaa24453e9b
2c30454 to
36ca7df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a concurrency issue in the snapshot defragmentation service by ensuring that the deletion of old snapshot checkpoint directories occurs under the snapshot content lock. The change prevents a race condition where cached RocksDB handles could be reused for writes after an atomic snapshot database switch, potentially causing writes to be directed to the old (soon-to-be-deleted) database instead of the new one.
- Moved
deleteSnapshotCheckpointDirectoriescall inside the snapshot content lock scope - Ensures proper sequencing of atomic DB switch and directory cleanup operations
- Maintains proper exception handling with the lock released in the finally block
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| checkpointMetadataManager = null; | ||
| // Switch the snapshot DB location to the new version. | ||
| previousVersion = atomicSwitchSnapshotDB(snapshotId, checkpointLocation); | ||
| omSnapshotManager.deleteSnapshotCheckpointDirectories(snapshotId, previousVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- if deleteSnapshotCheckpointDirectories() here needs to be protected inside a snapshot content lock, do we need to have snapshot content lock too in OMSnapshotPurgeResponse? deleteSnapshotCheckpointDirectories() is invoked there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't need SnapshotContentLock in purge as the snapshot is already going to deleted. DeleteSnapshotCheckpointDirectories would acquire snapshot db lock and ensure all handles for existing db and evict the instance from the cache altogether. This would ensure before another thread picks up the snapshot for writing we always pick the newest instance that we have moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok got it.
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| checkpointMetadataManager = null; | ||
| // Switch the snapshot DB location to the new version. | ||
| previousVersion = atomicSwitchSnapshotDB(snapshotId, checkpointLocation); | ||
| omSnapshotManager.deleteSnapshotCheckpointDirectories(snapshotId, previousVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title doesn't correctly reflect the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title should be:
Delete older snapshot checkpoint dirs under the snapshot content lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
thank you @jojochuang and @smengcl fir reviewing the patch |
What changes were proposed in this pull request?
A snapshot handle present inside a snapshot cache can be reused. Once the snapshot content lock is taken a rocksdb handle could be still present in the snapshot cache and after the atomic switch the old db handle could still get reused for writes thus missing potential writes on the newly switched db. Hence while performing the atomic switch a snapshot cache lock should be acquired to ensure the next db handle would always be from the new rocksdb handle. So deletion of older directories should be under the snapshot content lock.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14015
How was this patch tested?
Working on all unit tests for Defrag service in a separate jira